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Description 

BACKGROUND OF THE INVENTION 

5 Field of the Invention 

[0001] This invention relates to development of novel DNA-binding proteins and polypeptides by an iterative proc- 
ess of mutation, expression, selection, and amplification. The ability to create novel DNA-binding proteins will have far- 
reaching applications, including, but not limited to, use in: a) treating viral diseases, b) treating genetic diseases, c) 

10 preparation of novel biochemical reagents, and d) biotechnology to regulate gene expression in cell cuitures. Several 
workers have shown that repressors derived from bacteria function when expressed in eukaryotic cells (BREN84. 
FIGG88. BROW87, HUMC87. HUMC88), but none have shown how to generate proteins that bind sequence-specifi- 
cally to a predetermined DNA sequence. For reviews of transcriptional control in eukaryotic cells, see STRU87. 
JONE87, and MANI87. The present application deals only with sequence-specific DNA-binding proteins, abbreviated 

15 DBR 

[0002] Proteins, particularly repressors, having affinity for specific sites on DNA modulate transcription of genes. 
The best known are a group of proteins primarily studied in prokaryotes that contain the structural motif alpha-helix- 
turn-alpha-helix (H-T-H) (SAUE82. PAB084). These proteins bind as dimers or tetramers to DNA at specific operator 
sequences that have approximately palindromic sequences. Contacts made by two adjacent alpha helices of each 
20 monomer in and around two sites in the major groove of B-form DNA are a major feature in the DNA-protein interface. 
This group of proteins includes phage repressor and Cro proteins, bacterial metabolic repressors such as GalR. Lad, 
LexA. and TrpR, bacterial activator protein CAP and activator/repressor AraC, bacterial transposon and plasmid TetR 
proteins (PAB084), the yeast mating type regulators MATal and MATalpha2 (MILL85) and eukaryotic homeo box pro- 
teins (EVAN88). 

25 [0003] Interactions between dimeric repressors and approximately palindromic operators have usually been dis- 
cussed in the literature with attention focused on one half of the operator with the tacit or explicit assumption that iden- 
tical interactions occur in each half of the connplex. Departures from palindromic symmetry allow proteins to distinguish 
among multiple related operators (SADL83, SIM084). One must view the DNA-protein interface as a whole. The 
emphasis in the literature on dyad symmetry is a barrier to determining the requirements for general novel recognition 

30 of DNA by proteins. 

[0004] The equilibrium geometry and flexibility of DNA are determined by the sequence; see inter aha HOGA87. 
GART88, and ULAN87. The interactions of ionic, polar, and hydrophobic groups on the DNA with solvent molecules and 
' ions make detailed predictions of DNA conformation and binding properties very difficult; gL OHLE85. ULAN87. and 
OTWI88. 

35 [0005] Matthews (MATT88), commenting on the current collection of protein-DNA structures, concludes that: a) dif- 
ferent H-TH DBPs use their recognition helices differently, b) there is no simple code that relates particular base pairs 
to particular amino acids at specific locations in the DBP. and c) "full appreciation of the complexity and individuality of 
each complex will be discouraging to anyone hoping to find simple answers to the recognition problem/* Schleif 
(SCHL38) has characterized the study of DNA-binding proteins as a field still in its infancy and ennphasizes the difficul- 

40 ties of designing proteins that bind predetermined sequences. 

[0006] Prokaryotic repressors exist that are unrelated to H-T-H binding proteins. Some of these bind to approximate 
palindromic sequences (ag. Salmonella typhimurium phage P22 Mnt protein (VERS87a) and co|i TyrR repressor 
protein (DEFE86)). Others bind to operator sequences that are partially symmetric (S, typhimurium phage P22 Arc pro- 
tein, VERS87b; E^SQli Fur protein. DEL087: plasmid R6K pi protein. FILU85) or non-symmetric (phage Mu repressor. 

45 KRAU86). 

[0007] Genetics has enabled extensive analysis of prokaryotic DNA-binding proteins and their specific nucleic acid 
recognition sequences. It is not yet possible, however, to design a protein to bind strongly and specifically to an arbitrary 
DNA sequence. As taught by the present invention it is, nonetheless, possible to postulate a family of potential DBP 
mutants and identify one having the desired specificity by other means. 

50 [0008] Genetic studies of the D NA-binding proteins show that mutations in protein sequence that result in decrease 
of protein function fall into two overlapping classes: 1) those that destabilize the global protein structure or folding and 
2) those that specifically alter the binding properties. The first class illuminates the general problem of protein folding 
and stability, while the second defines the interactions involved in the formation and stabilization of the protein-DNA 
complex. Mutations in the operator yield additional information. 

55 [0009] Positions 84 to 91 in helix 5 of \ repressor have been subjected to extensive amino acid substitutions 
(RE1D88). Two or three positions were varied simultaneously through all twenty amino acids and those combinations 
giving normal function were selected. The authors neither discuss optimization of the number or positions of residues 
to vary to obtain any particular functionality, nor did they attempt to obtain proteins having alternate dimerization or rec- 
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ognition functions. 

10010] Pakula et al (PAKU86) have randomly nnutagenized X Cro. They sought and found non-functional mutants 
but did not seek or find proteins having novel DNA-binding properties, nor did they suggest how to select such proteins. 
[001 1 ] Sequence-independent DNA-protein interactions are thought to occur via electrostatic interactions between 

5 the backbone of the DNA and charged or polar groups of the protein {ANDE87. LEWI83. and TAKE85). Sequence-spe- 
cific interactions involve H-bonding. nonpolar, or van der Waals contacts between exposed side groups a groups of the 
polypeptide main chain and base pair edges exposed in the major and minor grooves of the DNA. 
[0012] Mutations that alter residues involved in specific binding interactions with DNA have been identified in 
prokaryotic DBPs, including X. 434, and P22 repressor and Cro proteins, P22 Arc and Mnt. and E, coN tjE and \ac 

TO repressors arxi oMr*. i nesti inuiciiiuii;a uooui n 1 1 eaiuui=o u lai ai c c^i^wocvi o<^i»cm, ... i— — •• - — 

protein-DNA complex. 

[0013] A few cases have been reported (BASS88. YOUD83. VERS85a. CARU87. WHAR85b. WHAR87. EBRI84. 
and SPIR88) in which a change in one or a few residues in a DNA-binding protein not only abolishes binding by the pro- 
tein to the wild-type operator but also confers strong binding to a different operator. In all the cited publications, altera- 

15 tion of binding specificity has been accomplished by using symmetrically-located pairs of alterations in the operator 
sites Single asymmetric changes or multiple changes asymmetrically located in either the binding protein or its oper- 
ator were not considered. In "helix swap" experiments (WHAR84, WHARSSb, WHAR85a, SPIR88, BUSH88. PAB084). 
multiple mutations are introduced into the DNA-binding recognition helix of H-T-H proteins with the goal of changing the 
operator specificity of one known DBP to that of a different known DBR 

20 [0014] An extension of the "helix swap" experiments uses a mixture o1 434 repressor and 434R[alpha3(P22R)] 
(HOLL88). This mixture recognizes and binds in vifro with high affinity to a 16 bp chimeric operator comprising a 434 
half-site and a P22 half-site, indicating that active heterodimers are formed. The authors did not extend the results to 
intracellular repression, nor did they perform mutagenesis of the repressors and selection of cells to aeate novel rec- 
ognition patterns. 

25 [001 51 Two approaches have been developed to create novel proteins through reverse genetics. In one approach, 
dubbed "protein surgery" (DILL87). a substitution is introduced at a single protein residue. This approach has been 
used to determine the effects on structure and function of specific substtutrans in trypsin (CRAI85. RA0S87. BASH87). 
[0016] The other approach has been to generate a variety of mutants at many loci within the cloned gene, the 
"geneKiirected random mutagenesis" method. The specific location and nature of the change or changes are deter- 
so mined eosj hoc by DNA sequencing. If loss of a wild-type function confers a cellular phenotype, one screens colonies 
for mutations; see. d PAKU86. This approach is limited by the number of colonies that can be examined. An additkjnal 
important limitation Is that many desirable protein alterations require multiple amino acid substitutions and thus are not 
accessible through single base changes or even through all possible amino acid substitutions at any one residue. 
[0017] The objective in both these approaches has been, however, to analyze the effects of a variety of point muta- 
35 tions. so that rules governing such substitutions could be developed (ULME83). Progress has been hampered by the 
efforts involved In using either method {ROBE86). 

[0018] Oliphant el aL (OLIP86) and Oliphant and Struhl (OLIP87) have demonstrated ligation and cloning of highly 
degenerate oligonucleotides and have applied saturation mutagenesis to the study of promoter sequence and function. 
They have suggested that similar methods could be used to study genetic expression of protein coding regions of 
40 genes, but they do not say how one should: a) choose protein residues to vary, or b) select or screen mutants with desir- 
able properties. 

[0019] W^rd et aL (WARD86) have engineered heterodimers from homodimers of tyrosyl-tRNA synthetase. Meth- 
ods of converting homodimeric DBPs into heterodimeric DBPs are disclosed in the present invention. Methods of deriv- 
ing slngle-polypeptkie pseudo-dimeric DBPs from homodimeric DBPs are disclosed in the examples of the present 

45 invention. 

[0020] Benson A aL (BENS86) have developed a scheme to detect genes for sequence-specif ic DNA-binding pro- 
teins. They do not consider non-symmetric target DNA sequences nor do they suggest mutagenesis to generate novel 
DNA-binding properties. Their method is presented as a method to detect genes for naturally occurring DNA-binding 
proteins. Because the selective system Is lytic growth of phage, low levels of repression can not be detected. Selective 
50 chemicals, as disclosed in the present application, on the other hand, can be finely modulated so that low level repres- 
sion is detectable. 

[0021] Eliedge and Davis (ELLE89a) and Elledge el aL (ELLE89b) have used an occluded aadA gene in a selection 
for cells expressing eukaryotic DBPs. The supposed recognition sequence of the sought DBP is incorporated Into the 
strong promoter that occludes aadA on a low^opy number plasmid. Their sy^em is presented as a tool for cloning pre- 
ss existing DBPs and there is no mention of variegation of the gene that encodes the potential DBR Furthermore, there is 
no discussion of th symmetry of the targ t sequence or of the symmetry of the DBR 

[0022] Ladner and Bird. WO88/06601 suggest strategies for the preparation of asymmetric repressors. In one 
embodiment, a gene is constructed that encodes, as a single polypeptide chain, th two DNA-binding domains of a nat- 



4 



EP 0 452 41 3 B1 



urally-occurring dimeric repressor, joined by a polypeptide linker that holds the two binding domains in the necessary 
spatial relatonship for binding to an operator. While they prefer to design the linker based on protein structural data (cL 
Udner. U.S. Patent 4 J04.692) they state that uncertainties in the design of the linker may be resolved by generating a 
family of synthetic genes, differing in the I inker- encoding subsequence, and selecting in vjvQ for a gene encoding the 
5 desired pseudo<Jimer. Ladner and Bird do not consider the background of false positives that would arise if the two- 
domain polypeptides dimerize to form pseudo-tetramers. 

[0023] The binding of lambdoid repressors. Cro and CI repressor, is taken, in WO88/06601. as canonical even 
though other DBPs were known having operators of different lengths, WO88/06601 maintains that the 17 bp lambdoid 
operators can be divided into three regions: a) a left arm of five bases, b) a central region of seven bases, and c) a right 
10 arm of five bases. Several other DBPs are kriOwn for which this division is inappropriate. Further. wOSS/06601 states 
that the sequence and composition of the central region, in which edges of bases are not contacted by the DBP, are 
immaterial. There is direct evidence for 434 repressor (KOUD87. KOUD88) that the sequence and composition of the 
central region strongly influences binding of 434 repressor. 

[0024] Once a pseudo-dimer is obtained, they then obtain an asymmetric pseudo-dimer by the following technique. 

15 Rrst the user of WO88/D6601 is directed to construct a family of hybrid operators in which the sequence of the left and 
right arms are specified; no specrfication is given for the central seven bases. In each member of the family, the left arm 
contains the same sequence as the wild-type operator left arm while the right arm 5-mer is systematically varied 
through all 1024 possibilities. Similarly, in the gene encoding the pseudodimer. the codons for one recognition helix 
have the wild-type sequence while the codons coding for the other recognition helix are highly varied. The variegated 

20 pseudodimer genes are expressed in bacterial cells, wherein the hybrid operators are positioned to repress a single 
highly deleterious gene. Thus, it is supposed that one can identify a recognition helix for each possible 5-mer right arm 
of the operator by in vivQ selection; the correspondences between 5-mer right arms and sequences of recognition hel- 
ices are compiled into a dictionary. The consequences of mutations or deletions in the deleterious genes are not con- 
sidered. WO88/06601 suggests that successful constructions may be very rare. aa. one in 10^. but ignore other genetic 

25 events of similar or greater frequency. 

[0025] To obtain a repressor for an arbitrary 1 7-mer operator, the user of WO88/06601 : 

a) finds the 5-mer sequence of the left arm in the dictionary and uses the corresponding recognition helix sequence 
in the first DNA-binding domain of the pseudodimer, 
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b) ignores the sequence and composition of the next seven bases, and 

c) finds the 5-mer sequence of the right arm in the dictionary and uses the corresponding recognition helix 
sequence in the second DNA-binding domain of the pseudodimer. 



[0026] WO88/06601 also envisions means for producing a heterodimeric repressor. A plasmid is provided that car- 
ries genes encoding two different repressors. A population of such plasmids is generated in which some codons are 
varied in each gene. WO88/06601 instructs the user to introduce very high levels of variegation without regard to the 
number of independent transformants that can be produced. WO88/06601 also instructs the user to introduce variega- 
40 tion at widely separated sites in the gene, though there is no teaching concerning ways to simultaneously introduce high 
levels of variegation at widely separated sites in tiie gene or concerning maintenance of diversity without selective pres- 
sure, as would be needed if the variegation were introduced stepwise. WO88/06601 teaches that codons thought to be 
involved in the protein-protein interface should be preferentially mutated to generate heterodimers. Cells transformed 
with this population of plasmids will produce both the desired heterodimer and the tvra "wild-type" homodimers. 
45 WO88/06601 advises that one select for production of the heterodimer by providing a highly deleterious gene controlled 
by a hybrid operator, and beneficial genes controlled by the wild-type operators. The fastest growing cells, it is taught, 
will be those that produce a great deal of the heterodimer (which blocks expression of the deleterious gene) and littie of 
the homodimers (so that the beneficial genes are more fully expressed). There is no consideration of mutations or dele- 
tions in the deleterious gene or in the wild-type operators; such mutations will produce a background of fast-growing 
50 cells that do not contain the desired heterodimers. 



SUMMARY OF THE INVENTION 

[0027] This invention relates to the development of novel proteins or polypeptides that preferentially bind to a spe- 
cific subsequence of double-stranded DNA (the "target**) using a novel scheme for in vivo selection of mutant proteins 
exhibiting tiie desired binding specificities. 

[0028] The invention relates in particular to a selection vector for selecting recipient cells transformed by such vec- 
tor that express a protein or polypeptide that binds specifically to a predetermined target DNA sequence borne by said 
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vector, which comprises a first and a second operon. each comprising at least one expressible gene, the genes of said 
first and second operon being different, a copy of the target DNA sequence being included in each operon and posi- 
tioned therein so that the recipient cells enjoy a selective advantage, other than resistance to lytic growth of phage, it 
they express a protein or polypeptide which binds to said copies of the target DNA sequence. 

5 [0029] The novel binding proteins or polypeptides may be obtained by mutating a gene encoding on expression: 1 ) 
a known DNA-binding protein within the subsequence encoding a known DNA-binding domain. 2) a protein that, while 
not possessing a known DNA-binding activKy. possesses a secondary or higher order structure that lends itself to bind- 
ing activity (clefts, grooves, helices, gicj. 3) a known DNA-binding protein but not in the subsequence known to cause 
the binding, or 4) a polypeptide having no known 3D structure of its own. 

10 [0030] This application uses the term "variegated DNA" to refer to a population of nrjolecules that have the same 
base sequence through most of their length, but that vary at a number of defined loci. Using standard genetic engineer- 
ing techniques, variegated DNA can be introduced into a plasmid so that it constitutes part of a gene (OLIP86. OLIP87. 
CHEN88, AUSU87. REID88). When plasmids containing variegated DNA are used to transform bacteria, each cell 
makes a version of the original protein. Each colony of bacteria produces a different version from most other colonies. 

15 If the variegations of the DNA are concentrated at loci that code on expression for residues known to be on the surface 
of the protein or in loops, a population of genes will be generated that code on expression for a population of proteins, 
many members of which will fold into roughly the same 3D structure as the parental protein. Most often we generate 
mutations that are concentrated within codons for residues thought to make contact with the DNA. Secondarily, we 
introduce mutations into codons specifying residues that are not directly involved in DNA contact but that affect the post- 
20 tion or dynamics of residues that do contact the DNA. ^ 
[0031] In general, a variegated population of DNA molecules, each of which encodes one of a large (eg^ 10 ) 
number of distinct potential target-binding proteins, is used to transform a cell culture. The cells of this cell cuHure are 
engineered with binding marker genetic elements so that, under selective conditions, the celt thrives only if the 
expressed potential target-binding protein in fact binds to the target subsequence preventing transcription of these bind- 

25 ing marker genetic elements. (Typically, binding of a successful target-binding protein to the target subsequence blocks 
expression of a gene product that is deleterious under selective conditions. Alternatively, binding of a successful target- 
binding protein can inactivate a strong promoter that othenwise ocdudes transcription of a beneficial gene.) The mutant 
cells are directed to express the potential target-binding proteins and the selective conditions are applied. Cells 
expressing proteins binding successfully to the target are thus identified by in vjvfi selection. If the binding characteris- 

30 tics are not fully satisfactory, the amino acid sequences of the best binding proteins are determined (usually by 
sequencing the corresponding genes), a new population of DNA molecules is synthesized that encode variegated 
forms of the best binding proteins of the last cull, mutant cells are prepared, the new population of potential DNA-bind- 
ing proteins is expressed, and the best proteins are once again Identified by the superior growth of the corresponding 
transformants under selective conditions. The process is repeated until a protein or polypeptide with the desired binding 

35 characteristics is obtained. Its corresponding gene may then be moved to a suitable expression system. 

[0032] In the sinplest form of this invention, the mutant cells are provided with a selectable genetic element, the 
transcription of which is deleterious to the survival or growth of the cell. The selectable genetic element either is a pro- 
moter or is operably linked to a promoter regulating the expression of tiie gene. The promoter, or other non-coding 
region of the genetic element (for example, an intron). has been modified to include the desired target subsequence in 

40 a position where it will not interfere with transcription of the selectable gene yrMfiss a protein binds to that target subse- 
quence. Each mutant cell is also provided with a gene encoding on expression a potential DNA-binding protein, opera- 
bly linked to a promoter ttiat is preferably regulated by a chemical inducer. When this gene is expressed, the potential 
DNA-binding protein has the opportunity to bind to the target and thereby protect the cell from the selective conditions 
under which tiie product of the binding marker gene would othenwise harm tiie cell. 

45 [0033] In addition to the desired outcome of these in vivo selections, there exist a number of possible genetic events 
that allow the cells to escape the selection, producing artifacts and inefficiency by allowing the growth of colonies that 
do not express the desired sequence-specific DNA-binding proteins. Examples of mechanisms, other than the desired 
outcome, tiiat lead to cell survival under the selective conditions include: a) a point mutation or a deletion in the selecta- 
ble gene eliminates expression or function of the selectable gene product; b) a host chromosomal mutation compen- 

50 sates for or suppresses function of the selectable gene product: c) the introduced potential DNA-binding protein binds 
to a DNA subsequence other than the chosen target subsequence and blocks expression of the selectable gene; d) the 
introduced potential DNA-binding protein binds to and inactivates the gene product of the selective gene; and e) a DNA- 
binding protein endogenous to tfie host mutates so that it binds to tiie selectable gene and blocks expression of the 

selectable gene. ^ • ^ 

55 [0034] This invention relates, in particular, to the design of a vector that confers upon the host cells the desired con- 
ditional sensitivity to the selection conditions in such a manner as to greatiy reduce the likelihood of false positives and 
artifactual colonies. 

[0035] Rrst. at least two selectable genes that are functionally unrelated are used to reduce the risk that a single 
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point mutation in the vector (or in th host chromosome) will destroy the sensitivity o1 the eel! to the selective conditions, 
since It will eliminate only one of the two (or more) deleterious phenotypes. Similarly, a singl introduced gene for a 
potential DNA-binding protein that binds to and inactivates the gene product of one selectable gene will not bind and 
inactivate the g^e product of th other selectable gene. The likelihood that point mutations will occur in both selectable 
genes or that two host chromosomal mutations will spontaneously arise that suppress the effects of two genes is the 
product of each single individual probabilities of the necessary event, and thus is extremely low. 
[0036] The DNA sequences of the two or more selectable genes preferably should not have long segments of iden- 
tity; a) to avoid isolation of a DBF that binds these identical regions instead of the intended target sequence, and b) to 
reduce the likelihood of genetic recombination. The degeneracy of the genetic code allows us to avoid exact identity of 
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[0037] Second, the selectable genes are placed on the vector in alternation with genetic elements that are essential 
to plasmid maintenance. Thus, a single deletion event, ©/en of thousands of bases, cannot eliminate both selectable 
genes without also eliminating vital genetic elements. Alternatively, the selectable genes are placed in the bacterial 
chronx>some. Spontaneous deletions from the chromosome are rare. 

[0038] Third, different promoters are associated with each of the selectable genes. This ensures that the selection 
does not isolate cells harboring genes encoding on expression novel DNA-binding proteins that bind specifically to sub- 
sequences that are pari of the promoter but not the chosen target subsequence. Each cell expresses only one or a few 
introduced potential DNA-binding proteins (multiple potential DNA-binding proteins could arise if one cell is transformed 
by two or more variegated plasmids). The probability that two such proteins will occur in one cell and that one will bind 
to the promoter of the first selectable gene and that the second will bind to the different promoter of the second selecta- 
ble gene is very small. 

[0039] Fourth, the selectable binding marker genes may be placed on a vector different from the vector that can-ies 
the potential c&b gene. DNA manipulations that introduce variegation into the potential ^ gene can cause mutations 
in the vector remote from the site of the intended mutations. Thus, we may place the selectable binding marker genes 
in the bacterial chromosome or on a separate plasmid that is compatrible with the dbfi vector. 
[0040] Finally, the same promoter is used to initiate transcription of two genes: a) one of the deleterious selectable 
binding marker genes, and b) a beneficial or essential gene also borne on the plasmid and used to select for uptake 
and maintenance of the plasmid (£^ an antibiotic resistance gene, such as bla). In the case of the beneficial or essen- 
tial gene, however, there is no instance of the predetermined target DNA subsequence associated with the promoter. 
Thus, if a DNA-binding protein binds to a subsequence of the promoter other than the predetermined target DNA sub- 
sequence, it will faistrate expression of the beneficial or essential one. If desired, more than one such beneficial or 
essential gene may be provided. In that event, copies of promoter A may be operably linked to both deleterious gene A* 
(with an instance of the target) and beneficial gene A" (without an instance of the target), while copies of promoter B 
are operably linked to both deleterious gene B' (with target) and beneficial gene B" (without target). 
[0041 ] The selection system described above is a powerful tool that eliminates most of the artifacts associated with 
selections based on cloning vectors that use a single selectable gene or that have all selectable genes in a contiguous 
region of the plasmid. While this invention embraces using the aforementioned elements of a selection system singly or 
in partial combination, most preferably all are employed. 

[0042] In one embodiment the invention relates to a cell culture comprising a plurality of cells, each cell bearing: 

i) a gene coding on expression for a potential DNA-binding protein or polypeptide, where such protein or polypep- 
tide is not the same for all such cells, but rather varies at a limited number of amino acid positions; and 

ii) at least two independent operons, each comprising at least one binding marker gene coding on expression for a 
product conditionally deleterious to the survival or reproduction of such cells, the promoter of each said binding 
marker gene containing a predetermined target DNA subsequence so positioned that if said target DNA subse- 
quence is bound by a DNA-binding protein or polypeptide, said conditionally deleterious product is not expressed 
in functional form. 

[0043] Abolition of function is much easier than engineering of novel function. Reverse selection can isolate cells 
that: a) express no DBF. b) express unstable proteins descendant from a parental DBF, c) express a protein descendant 
from a parental DBF having very nearly the same 3D structure as the parental DBF. but lacking the functionality of the 
parent. We are interested in this third class. It is difficult, however, to distinguish among these classes genetically. 
Therefore, when using reverse selection, we carefully choose sites to mutate the protein (so as to minimize the chances 
of destroying tertiary structure) and we introduce a lower level of variegation than in fonward selection. We must verify 
biochemically that a stable, folded protein is produced by the isolated cells. 

[0044] Another concept of the present invention is the use of a polypeptide, rather than a protein, to prefer ntialty 
bind DNA This polypeptide, instead of binding the DNA molecule as a preformed molecule having shape complemen- 
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tary to DNA. will wind about the DNA molecule in the major or minor groove. Such a polypeptide has the advantages 
that: a) it is smaller than a protein having equivalent recognizing ability and may be easier to introduce into cells, and b) 
it may serve as a model for creation of other compounds that bind DNA sequence-specrfically. 
[0045] In a preferred embodiment, transcription of the DNA that codes on expression for potential-DNA-binding pro- 
teins or polypeptides is regulated by addition of chemical inducer to the cell culture, such as isopropyithiogalactoside 
(tPTG). Other regulatable promoters having different inducers or other means of regulation are also appropriate. 
[0046] The invention encompasses the design and synthesis of variegated DNA encoding on expression a collec- 
tion of closely related potential DNA-binding proteins or polypeptides characterized by constant and variable regions, 
said proteins or polypeptides being designed with a view toward obtaining a protein or polypeptide that binds a prede- 
termined target DNA subsequence. 

[0047] For the purposes of this invention, the term "potential DNA-binding polypeptide" refers to a polypeptide 
encoded by one species of DNA molecule in a population of variegated DNA wherein the region of variation appears in 
one or more subsequences encoding one or more segments of the polypeptide having the potential of serving as a 
DNA-binding domain for the target DNA sequence or having the potential to alter the position or dynamics of protein 
residues that contact the DNA. A "potential DNA-binding protein** (potential-DBP) may comprise one or more potential 
DNA-binding polypeptides. Potential- DBPs comprising two or more polypeptide chains may be homologous aggregates 
( e.g. A2) or heterologous aggregates (aa. AB). 

[0048] From time to time, it may be helpful to speak of the '^parental sequence" of the variegated DNA. When the 
novel DNA-binding domain sought is a homolog of a known DNA-binding domain, the parental sequence is the 
sequence that encodes the known DNA-binding domain. The variegated DNA is identical with this parental sequence 
at most loci, but will diverge from it at chosen loci. When a potential DNA-binding domain is designed from first princi- 
ples, the parental sequence is a sequence that encodes the amino acid sequence that has been predicted to form the 
desired DNA-binding domain, and the variegated DNA is a population of "daughter DNAs" that are related to that parent 
by a high degree of sequence similarity. 

[0049] The fundamental principle of the invention is one of forced evolution. The efficiency of the forced evolution 
is greatly enhanced by careful choice of which residues are to be varied. The 3D structure of the potential DNA-binding 
domain and the 3D structure of the target DNA sequence are key determinants in this choice. First a set of residues 
that can either simultaneously contact the target DNA sequence or that can affect the orientation or flexibility of residues 
that can touch the target is identified. Then all or some of the codons encoding these residues are varied simultane- 
ously to produce a variegated population of DNA. The variegated population of DNA is introduced into cells so that a 
variegated population of cells producing various potential-DBPs is otrtained. 

[0050] The highly variegated population of cells containing genes encoding potential-DBPs is selected for cells 
containing genes that express proteins tiiat bind to the target DNA sequence ("successful DNA-binding proteins"). After 
one or more rounds of such selection, one or more of the chosen genes are examined and sequenced. If desired, new 
loci of variation are chosen. The selected daughter genes of one generation then become the parental sequences for 
the next generation of variegated DNA (vgDNA). 

[0051 ] DNA-binding proteins (DBPs) that bind specifically to viral DNA so that transcription is blocked will be useful 
in treating viral diseases, either by introducing DBPs into cells or by introducing the gene coding on expression for the 
DBP into cells and causing the gene to be expressed. In order to develop such DBPs. we need use only the nucleotide 
sequence of ttie viral genes to be repressed. Once a DBP is developed, it is tested against virus ia^. Use of several 
independently-acting DBPs that all bind to one gene allow us to: a) repress the gene despite possible variation in the 
sequence, and b) to focus repression on the target gene while distributing side effects over the entire genome of the 
host cell. Animals, plants, fungi, and miaobes can be genetically made intracellularly immune to viruses by introducing, 
into the germ line, genes that code on expression for DBPs that bind DNA sequences found in viruses that infect the 
animal (including human), plant fungus, or microbe to be protected. 

[0052] Sequence-specifk: DBPs may also be used to treat autoimmune and genetic disease either by repressing 
noxious genes or by causing expression of beneficial genes. 

[0053] Some naturally-occurring DBPs bind sequence-specrfically to DNA only in the presence of absence of spe- 
cific effector molecules. For example. Lac repressor does not bind the \ac operator in the presence of lactose or isopro- 
pyithiogalactoside (IPTG); Trp repressor binds DNA only in the presence of tryptophan or certain analogues of 
tryptophan. The method of the present invention can be used to select nrrutants of such DBPs that a) recognize a differ- 
ent cognate DNA sequence, or b) recognize a different effector molecule. These alterations would be useful because: 
a) known inducible or de-repressible DBPs allows us to use the novel DBP without affecting existing metabolic path- 
ways. Having novel effectors allows us to induce or de-repress the regulated gene without altering the state of genes 
that are controlled by the natural effectors. In addition, temperature-sensitive DBPs could be made which would allow 
us to control gene expression in the same way that X cl857 and Pr and Pl ar used. 

[0054] Confening novel DNA-recognition properties on proteins will allow development of novel restriction enzymes 
that recognize more bas pairs and therefore cut DNA less frequently For example, the methods of the present inven- 
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lion will be useful in developing a derivative of £qqR\ (recognition GAATTC) that recognizes and cleaves a longer rec- 
ognition site, such as TGAATTCA. Proteins that recognize specific DNA sequences may also be used to block the 
action of known restriction enzymes at some subset of the recognition sites of the known enzyme, thereby conferring 
greater specificity on that enzyme. Other DNA-binding enzymes nr^ay also be obtained by the methods described 
5 herein. 

[0055] The methods of the present invention are primarily designed to select from a highly variegated population 
those cells that contain genes that code on expression for proteins that bind sequence-specifically to predetermined 
DNA sequences. The genetic constructions employed can also be used as an assay for putative DBPs that are obtained 
in other ways. 

w 

BRIEF DESCRIPTION OF THE DRAWINGS 
[0056] 

15 Figure 1 Schematic of protein bound to DNA. 

Figure 2 Schematic of evolution of a binding protein. 

Figure 3 Plasmid pKKI 75-6. 

Figure 4 Plasmid pAA3H. 

Figure 5 Summary of construction of pEP1009. 
20 Figures Plasmid pEPIOOI. 

Figure 7 Plasmid pEP1002. 

Figure 8 Plasmid pEP1003. 

Figure 9 Plasmid pEP1004. 

Figure 10 Plasmid pEP1005. 
25 Figure 1 1 Plasmid pEP1007. 

Figure 12 Plasmid pEP1009. 

DETAILED DESCRIPTION OF THE INVENTION AND ITS PREFE RRED EMBODIMENTS 

30 Abbreviations : 

[0057] The following abbreviations will be used throughout the present invention: 
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Abbreviation 


Meaning 


□BP 


DNA-binding protein 


ma 


A gene encoding the initial DBP 


pdbp 


A gene encoding a potential-DBP 


vgDNA 


variegated DNA 


dsDNA 


double-stranded DNA 


ssDNA 


single-stranded DNA 


Tc". Tc^ 


Tetracycline resistance or sensitivity 


Gal". Gal® 


Galactose resistance or sensitivity 


Gar, Gal- 


Ability or inability to utilize galactose 


Fus", Fus® 


Fusaric acid resistance or sensitivity 


Km^^. Km® 


Kanamycin resistance or sensitivity 


Apf^. ApS 


Ampicillin resistance or sensitivity 



55 

Terminology 

[0058] A domain of a protein that is required for the protein to specifically bind a chosen DNA target subsequence. 
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is referred to herein as a "DNA-binding domain". A protein may comprise one or more domains, each composed of one 
or more polypeptide chains. A protein that binds a DNA sequence specHlcally is denoted as a 'DNA-binding protein". 
In one embodiment of the present invention, a preliminary operation is performed to obtain a stable protein, denoted as 
an "initial DBP", that binds one specific DNA sequence. The present invention is concerned with the expression of 
numerous, diverse, variartt "potential- DBPs", all related to a "parental potential-DBP" such as a known DNA-binding 
protein, and with selection and amplification of the genes encoding the most successful mutant potentiat-DBPs. An ini- 
tial DBP is chosen as parental potential-DBP for the first round of variegation. Selection isolates one or more "success- 
ful DBPs". A successful DBP from one round of variegation and selection is chosen to be the parental DBP to the next 
round. The invention is not, however, limited to proteins with a single DNA binding domain since the method may be 
applied to any or all of the DNA binding domains of the protein, sequentially or simultaneously 
[00591 Amino acids are indicated by the single-letter code. AUSU87, Appendix A. 

[0060] Symbols that represent ambiguous DNA are: T. C. A, G for themselves; M for A or C: R for A or G; W for A 
or T;SforCorG:YforTorC;KforGor T;Vfor A. C. orG;Hfor A. C, orT;Dfor A. G. or T;8for C. G. or T;Nforany 

base. 

[0061] Conventionally, DNA sequences are written from 5' to 3'. left-to-right. 

anti-sense DNA: 5» ATG CTT TTC ... 3» 
sense DNA: 3« TAG GAA AAG ... 5» 



mRNA: 5» AUG CUU UUC 
protein: M - L - F - 

We will use the convention that the "sense" strand is the strand used as template for mRNA synthesis. 
[0062] In the present invention, the words "grow", "growth^ "cutture \ and "amplification" mean increase in number, 
not increase in size of individual cells. In the present invention, the words "select" and "selection" are used in the genetic 
sense; Le. a biological process whereby a phenotypic characteristic is used to enrich a population for those organisms 
displaying the desired phenotype. 

[0063] One selection is called a "selection step"; one pass of variegation followed by as many selection steps as 
are needed to isolate a successful DBP, is called a "variegation step^ The amino acid sequence of one successful DBP 
from one round becomes the parental potential-DBP to the next variegation step. We perform variegation steps itera- 
tively until the desired affinity and specificity of DNA-binding between a successful DBP and chosen target DNA 
sequence are achieved. 

[0064] In a "fonward selection" step, we select for the binding of the PDBP to a target DNA sequence; m a reverse 
selection** step, for failure to bind. The target DNA sequence may be the final target sequence of interest, or the imme- 
diate target may be a related sequence of DNA (e.g.. a "left symmetrized target" or "right symmetrized target"). There 
is an important distinction between screening and selection. Screening merely reveals which cells express or contain 
the desired gene. Selection allows desired cells to grow under conditions in which there is little or no growth of unde- 
sired cells (and preferably eliminates undesired cells). 

[0065] The term "operon" is used to mean a collection of one or more genes that are transcribed together. We will 
use operon to refer also to one or more genes that are transcribed together in eukaryotic cells independent of post-tran- 
scriptional processing. 

[0066] The term Tsinding marker gene" is used to mean those genes engineered to detect sequence-specific DNA 
binding as by association of a target DNA with a structural gene and expression control sequences. A single operon 
may include more than one binding marker gene (e.g., g alT . K) . A "control marker gene" is one whose expression is not 
affected by the specific binding of a protein to the target DNA sequence. The "control promoter" is the promoter opera- 
bly linked to the control marker gene. 

[0067] Palindrome, palindromic, and palindromically are used to refer to DNA sequences that are the same when 
read along either strand, 
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Palindromic DNA 
Rotational axis 
5' C T A G C C T.A G G C T A G 3» 

The arrow indicates the center of the palindrome; if the sequence is rotated 180° about the central dot, it appears 
unchanged. In the present application, "Palindromic" does not apply to sequences that have mirror symmetry within one 
strand, such as 

Mirror Plane 

I 

5' C T A G C C T|T C C G A T C 3' 
3' 6 A T C G G A|A G G C T A G 5' 

I 

DNA sequences can be partially palindromic about some point (that can be either between two base pairs or at one 
base pair) in which case some bases appear unchanged by a 180° rotation while other bases are changed. 
[0068] A special case of partially palindromic sequence is a "gapped palindrome" in which palindromically related 
bases are separated by one or more bases that lack such symmetry: 

Gapped Palindrome 
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 

5» C T A G C T T T C C G G C T h — S 3' 

3« G A T C G A A A G G C Q fi h I—C 5' 

has CTAGC (bases 1-5) palindromically related to GCTAG (bases 12-16) while the sequence TTTCCG (bases 6-1 1) in 
the center has no symmetry. 

[0069] For the purposes of this invention, a "non-deleterious cloning site" is a region on a plasmid or phage that can 
be cut with one restriction enzyme or with a combination of restriction enzymes so that a large linear molecule compris- 
ing all essential elements can be recovered. 

Overview: Standard Methods 

[0070] Bacterial strains are cultured by standard methods (DAV180, MtLL72, AUSU87). Constructions of vectors 
are by standard methods (MANI82. ZOLL84. AUSU87). All genetic constructions are confirmed, first by analysis with 
restriction enzymes, and then by sequencing. Sequencing is by the Sanger dideoxy method or by Maxam Gilbert chem- 
ical method. Constructions that confer a phenotype are tested for display of the desired phenotype. These necessary 
controls are not desaibed repeatedly. 

Overview: The Selection System 

[0071 ] The present invention separates mutated genes that specify novel proteins with desirabl sequence-specific 
DNA-binding properties from dosely related genes that specify proteins with no or undesirable DNA-binding properties, 
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by; 1) arranging that the product oi each mutated gene be expressed in the cytoplasm of a cell carrying a chosen DNA 
target subsequence, and 2) using genetic selections incorporating this chosen DNA target subsequence to enrich the 
population of cells for those cells containing genes specifying proteins with improved binding to the chosen target DNA 
sequence. 

[0072] A selectably deleterious gene is positioned relative to. usually dovi^nstr earn from, the target sequence so that 
the gene is not expressed if a successful DNA-binding protein specific to this target is expressed in the cell and binds 
the target sequence. Alternatively, a selectable beneficial gene may be arranged so that its transcription is occluded by 
a strong promoter (ADHY82. ELLE89a. ELLE89b). The target sequence is placed in or near the occluding promoter so 
that successful binding by a protein will repress the occluding promoter and allow transcription of the beneficial gene. 
Elledge and coworkers disclose that such systems work best in the bacterial chromosome or on low-copy-number plas- 
mids. The cell will survive exposure to the selective conditions transcription of the selectably deleterious genetic ele- 
ment is blocked. 

[0073] The preferred cell line or strain is easily cultured, has a short doubling time, has a large collection of well 
characterized selectable genes, includes variants that are deficient in genetic recombination, and has a well developed 
transformation system that can easily produce at least 10^ independent transformants/ug of DNA. Bacterial cells are 
preferred over yeasts, fungi, plant, or animal cells because they are superior on every count. Among bacteria. £,CQli is 
the premier candidate because of the wealth of knowledge of genetics and cellular processes. Other bacterial strains, 
such as S. tvphimurium. Pseudomonas aeruginosa . Klebsiella aerooenes. Bacillus subtiljs. or Streptomyces coelicolor 
could be used. DBPs that bind to host regulatory sequences, such as promoters, will be toxic. Thus, development of a 
DBP that specifically binds to E ^ promoters is preferably done in a cell line or strain, such as S, coelicolor , having 
significantly different promoter sequences. 

[00741 In the most preferred embodiments, all novel DBPs are developed in E Sili iSQh' strains. The rgcA' geno- 
type is preferred over other rg^* mutations because rs£A" mutation reduces the frequency of recombination more than 
other known rec' mutations and the recA" mutation has fewer undesirable side effects. We choose a host strain that 
methylates or does not methylate the target sequence in the desired way. For example a Dcm" strain is appropriate if 
the target sequence contains CCwGG and we want a DBP that binds the unmethylated form. 
[0075] As vectors, phage, such as Ml 3. have the advantage of a high infectivity rate. Organisms or phage having 
a phase in their life cycle in which the genome is single-stranded DNA have a higher mutation rate than organisms or 
phage that have no phase in which the genome is single-stranded DNA. Plasmids are, however, preferred because 
genes on plasmids are much mae easily constructed and altered than are genes in the bacterial chromosome and are 
more stable than genes borne on phage, such as M 13. Ml 3 derived vectors are nearly as preferred as plasmids. 
[0076] In some embodiments, the cloning vector will carry: a) the selectable genes for successful DBP isolation, b) 
the QdbB gene, c) a plasmid origin of replication, and d) an antibiotic resistance gene not present in the recipient cell to 
allow selection for uptake of plasmid. Preferably the operative vector is of minimum size. 

[0077] Alternatively, the selectable binding marker genetic elements are placed on a vector different from but com- 
patible with the vector that carries the edbe gene. This arrangement has the advantages that engineering the edbg 
gene is easier on a smaller plasmid and manipulation of 0*6 can not introduce mutations into the selectable binding 

marker genes. . • , • 

[0078] Standard selections for plasmid uptake and maintenance in E. coH include use of antibiotics (ejg, ampiciiiin 
(Ap)) as shown in Table 2. Selection of cells with antibiotics is preferred to nutritional selections, fi^ TrpA"'. for several 
reasons. Nutritional selection may be overcome by large volumes of cells or growth medium; host chromosomal auxo- 
trophy is rarely total; crossfeeding of the non-growing cells by prototrophic recipients obscures the outlines of the colo- 
nies; and late mutations to prototrophy may arise on the plate due to spontaneous mutation of nongrowing cells. 
Nonetheless, nutritional selection may be employed. 

[0079] Similarly, plasmids for use in B. subtilis are engineered for selection of uptake and maintenance using anti- 
biotics. Plasnwte used in sfreptomycete species bear genes for resistance to antibiotics such as thiostrepton. neomy- 
cin, and methylenomycin. in preference to auxotrophic markers or sporulation and pigment screens such as SDQ m 
bacilli and niel in stroptomycetes. 

[0080] Recombinant DNA manipulations in yeasts have been achieved using complementation of auxotrophic 
markers some of which are shown in Table 3. High backgrounds are surmounted by use of two unrelated binding 
marker genes carried on the same vector, ag. Leu2* and UraS*. Selection for G418 resistance conferred by the bac- 
terial aohll gene expressed in yeast offers the advantages of reduced background and a wider range of appropriate 
recipient strains. The current upper range of efficiency of DNA uptake into yeast cells indicates that this organism is not 
now preferred for the process described in this patent, although results could be achieved by large scale practice. 
[0081 ] The selection systems must be so structured that other mechanisms for loss of gene expression are much 
less likely than the desired result, repression at the target DNA subsequence. Other mechanisms that could yield the 
desired phenotype include: point mutations that inactivate the deleterious gene or genes, deletion of the deleterious 
gene or genes, host mutations that suppress the deleterious genes, and repression at a site other than the target DNA 
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sequence. 

[0082] A wide range of selectable phenotypes for E coli and S, tvphinnurium have been described (VIN087). Two 
broad classes of selections are useful in this invention, nutritional and chemical. Such selections are inherently condi- 
tional in that they employ addition of a growth- inhibitory chemical to the selective medium, or manipulation of the nutri- 

5 ent components of the selective medium. Further conditionality of the preferred method is imposed by transcriptional 
regulation (e,Q. by IPTG in combination with the lacUVS promoter and the LacP repressor) of the variegated pdbp gene. 
In those members of the population that express DBFs that bind to the target IPTG indirectly controls the selectable 
genes; in these cells, increased IPTG leads to reduced expression of the selectable genes. Therefore the phenotypes 
for selection are distinguished only in the presence of an inducing chemical, and potential deleterious effects of these 

w phenotypes are avoided during storage and routine handlirsg of the strains, 

[0083] Selection of mutant strains capable of producing proteins that can bind to the target DNA subsequence is 
enabled by engineering conditional lethal genes or growth-inhibiting genes located downstream from the promoter that 
contains the target DNA subsequence. In the preferred embodiment, at least two independent conditional lethal or 
inhibitory selections are performed simultaneously It is possible to use a single selection to achieve the same purpose, 

15 but this is not preferred. Two selections are strongly preferred since a simple mutation in the selected gene, occurring 
at a frequency of 10'® to lO"®/cell, would occur in two selected genes simultaneously at the product of the individual fre- 
quencies, 10*^^ to 10'^®. Thus use of two selections substantially reduces the probability of isolation of artrlactual rey/er- 
tant or suppressor strains. 

[0084] Selectable genes for which both fonward and reverse selections exist are preferred because, by changing 
20 host or media, we can use these genes to select for binding by a DBF to a target DNA sequence such that expression 
of one of these genes is repressed, or we can select phenotypes characteristic of cells in which there is no binding of 
the DBF. For example, expression of the lei gene is essential in the presence of tetracycline. On the other hand, expres- 
sion of the tfil gene is lethal in the presence of fusaric acid. Expression of the gall and galK genes in a GalE' host in 
the presence of galactose is lethal (NIKA61). Expression of gaTT and ga!K in a host that is GalE* and either GalT or 
25 GalK' renders the cells Gal^ and allows them to grow on galactose as sole carbon source. 

[0085] The term "source of a selective agent" includes the selective agent itself and any media components which 
cause the cell to manufacture tiie selective agent. 

[0086] The Detailed Examples desaibe selection of strains with successful DBF binding to novel target subse- 
quences due to turn off of two genes, each of which, if expressed, confers sensitivity to a toxic substance. It is also pos- 

30 sible to use selection of strains in wNch successful DBF binding to novel target operators turns off repressors of genes 
encoding required gene products. For example, using the binding marker gene P22 arc, we place an Arc operator site 
so that binding of Arc represses expression of a beneficial or conditionally essential gene, such as amp. Another alter- 
native is selection of expression of required gene products due to successful binding of DBF proteins derived from pos- 
itive effectors as the DBF. CAP from E, CQlL the repressor from phage X, or the Cro67 (BUSH88) mutant of X Cro. 

35 Another alternative is to place the target sequence in or near a strong promoter that occludes transcription of a condi- 
tionally essential gene (ELLE89a.b). 

[0087] The selections described in the Detailed Examples employ commercially available cloned genes on plas- 
mids in strains that can be obtained from the ATCC (Rockville. MD). Alternatively, the genes can be produced synthet- 
ically from published sequences or isolated from a suitable genomic or cDNA library 

40 [0088] Numerous types of selections are possible for selection of DBF expression in £. coli. The toxic arxJ inhibitory 
agents listed in Table 4 are used with appropriately engineered host strains and vectors to select loss of gene function 
listed above. Repression of transcription of these genes allows growth in the presence of tiie agents. Other outcomes 
such as deletions or point mutations in these genes may also be selected with these agents, hence two functionally 
unrelated selections are used in combination. These agents share the property that cell metabolism is stopped, and 

45 unlike the nutritional selections, the inhibitory agents are not overcome by components of the growth medium or turno- 
ver of macronrolecules in the cells. Selections using antibiotics, metabolite analogs, or inhibitors are preferred. Another 
class of selections includes those for repression of phage or colidn receptors, or for repression of phage pronroters. 
These agents kUI by single-hit kinetics, and in the case of phage, are self-replicating, making the nrultiplicity of agent to 
putative repressed cell much more difficult to control and so are not preferred (BENS86). 

50 [0089] Any selection system relevant to the cell line or strain may be substituted for those in the examples given 
here, with appropriate changes in the engineering of the cloning vectors. One example is tiie dominant phgy gene car- 
ried on plasmid pHE3 (ATCC #37.161) in a pheS12 background. Turn-off of pheS" *- is selected with p-fluorophenyla- 
lanine (Sigma Corp.. St. Louis. fAO). 

[0090] We cou W choose the Streotomvces coelicolor cloned glucose kinase gene for selection of the DBF* pheno- 
55 type, using the metabolite analog deoxyglucose. 

[0091] Each batch of antibiotic is checked for MIC (minimum inhibitory concentration) under th condition of use. 

Increased concentration of antibiotic may be used to increase th string ncyoftii selectio. in most cases. 

[0092] The user varies the medium formulation (pH. cation concentrations, buffering agent, filtj for a particular 
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selection rf the results are not optimal with the strain at hand. For example. Maloy and Nunn (MAL081) describe a 
medium yielding improved selection of Fus" soli colonies from a Tc" background, compared to the medium 
employed by Bochner (BOCH80) for this purpose using typhimurlum . 

[00931 Stringency of selection can be modulated by controlling copy number of plasmids bearing the selectable 
genes; increasing copy number of selectable genes increases the stringency of the selection. 
[0094] During the initial phases of the progressive development of DBF molecules, it is desirable to produce a high 
intracellular concentration of DBR The stringency of the selection is increased in subsequent phases of successful DBF 
development by allowing fewer molecules of DBF per cell. Thus it is preferred to regulate transcription of pdbp by an 
inducible or derepressible promoter, such as FlacUVS . 

[0095] High total eel! input often decreases stringency of selections, by providing metabolites that are specifically 
omitted, by mass action with respect to an inhibitory agent, or by generating a large number of artificial satellite colonies 
that follow the appearance of genetically resistant colonies. The number of cells that are successfully transformed is a 
function of efficiency of ligation and transformation processes, both of which are optimized in the embodiment of this 
invention. Procedures for maximal transformation and ligation efficiency are from Hanahan (HANA85) and Legerski and 
Robberson (LEGE85) respectively. Increasing stringency is imposed under the conditions of high efficiency of these 
processes by inoculation of plates with small volumes or dilutions of cell samples. Pilot experiments are performed to 
determine optimum dilution and volume. 

[0096] In Detailed Example 1 , the transformation event is followed by dilution and growth of cells in pernnissive 
medium following transformation. Exogenous inducer of DBF expression is included at this step, and a set of selections 
are then imposed in liquid medium. Surviving cells are concentrated by centrifugation, and selected for these and addi- 
tional traits using solid medium in Petri plates. This protocol offers the advantage that fewer identical siblings are 
obtained and a larger population is easily screened. In Detailed Example 1. repression of the Gal® phenotype is 
selected by exposing transformants to galactose in liquid medium, which produces visible lysis of galactose sensitive 
cells. The second selection employed in Detailed Example 1 is for the Pus" phenotype due to repression of Tc"^, which 
requires limitation of total inoculum size to 10^ cells/plate. Similar protocol variations are introduced to combine selec- 
tions for transformation and successful DBF function. 

[0097] Tests of selective agents to determine the conditions that kill or inhibit sensitive cells are performed with pure 
cultures of sensitive cells. These include strains carrying the selective marker genes having the recognition sequence 
of the IDBF as target, with and without idbo. and with and without the inducer of idtaB expression. 
[0098] Cultures of sensitive cells are applied to selective media as inocula appropriate to the selection (usually 10^ 
to 10^ per plate). Sufficient numbers of replicates (10^ to 10^ total sensitive cells for each medium) are tested by each 
selection. The rate at which the cultures produce revertants and phenotypic suppressors (considered together as re^er- 
tants) is determined. A rate greater than 10"® per cell Indicates that stringency must be increased. If reversion rates are 
below this level, as we have shown for the selections described in Example 1, mixing experiments are performed to 
determine the sensitivity of recovery of a small fraction of resistant cells from a vast excess of sensitive cells. 
[0099] Normally, the deleterious gene product of a binding marker gene is a protein. It may also be an RNA, e.g., 
an mRNA which is antisense to the mRNA of an essential gene and therefore blocks translation of the latter mRNA into 
protein. Another alternative is that transcription of the binding marker gene may be deleterkHJS because this transcrip- 
tion occludes transcription of an adjacent beneficial gene. Selectively deleterious genes suitable for use in the present 
inverrtion include those shown in Table 4. 

[0100] The two selectably deleterious genes are preferably not functionally related. For example, the chosen genes 
should not code for proteins localized to or affecting the same macronralecular assembly in the cell or which alter the 
same or intersecting anabolic or catabotic pathways. Thus, use of two inhibitors that select for mutations affecting RNA 
synthesis, aromatic amino acid synthesis, or each of histidine and purine synthesis are not preferred. Similarly, two 
inhibitors that are transported into the cell by shared membrane components are thus functionally related, and are not 
preferred. In this manner the user reduces the frequency of isolation of single host mutations that yield the apparent 
desired phenotype, because of suppression of the shared functionality, interacting component, or precursor relation- 
ship. Host mutations of this type are conveniently distinguished by a screen of the selectable phenotypes in the absence 
of the inducer of the DBF. ag, IPTG. 

[0101 ] Examples of pairs of deleterious genes which are recommended for use in the present invention are given 
in Table 5A. In each case, one of the paired genes codes for a product that acts intracellularly while the other codes for 
a product that acts either in transport into or out of the cell or acts in an unrelated biological pathway Table 5B gives 
some pairs that are not recommended. These pairs have not been shown to malfunction, but they are not recom- 
mended, given the large number of choices that are clearly functionally unrelated. 

[0102] A preferred novel feature is the use of a copy of the promoter of one of these beneficial or conditionally 
essential genes, operably linked to the target DNA subsequence, to direct transcription of the selectably deleterious or 
conditionally lethal binding marker genes of the plasmid. H the potential-DBP should repress the selectable gene by 
binding to this promoter, it would also repress this beneficial activity. 
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[01031 In order to assure that selection for DBP binding is specific to the target and not the promoter, we, preferably, 
place one of the two selectable binding marker genes under the same transcription initiation signal as the gene we use 
for selection of vector maintenance. In Detailed Example 1 . transcription of the flail and galK genes is initiated by the 
Pamp promoter, as is the anne gene. 

[0104] It is possible that the potential-DBP will bind specifically to the boundary between the target DNA sequence 
and the promoter, or within the structural gene. In the preferred embodiment, we discriminate against this mechanism 
by choosing a different promoter, operably linked to another copy of the same target DNA sequence, for the second 
selectable gene. Preferably the two promoters that initiate transcription of the selectable genes should be strong 
enough to give a sensitive selection, but not too strong to be repressed by binding of a novel DBR Some well studied 
promoters and their scores by the Mulligan algorithm (MULL84) are shown in Table 6. Promoters that score between 
50% and 70% are good candidates for use in binding marker genes. Preferably, the two promoters have significant 
sequence differences, particularly in the region of the junction to the target DNA sequence. Specifically the region 
between the -10 region and the target sequence, which comprises five to seven bases, should have no more than two 
identical bases in the two promoters. Although the -1 0 regions of promoters show high homology pronfX)ters are known 
( e.g. p^^p having GACAAT and Ppeo having TAAGGT) that have as few as two out of six bases identical in this region, 
and such difference is preferred. 

[0105] The target DNA sequence for the potential DNA-binding protein must be associated with the two deleterious 
binding marker genes and their promoters so that expression of the binding marker genes is blocked if a novel protein 
in fact binds to the target sequence. The target DNA sequence could appear upstream of the gene, downstream of the 
gene. or. in certain hosts, in a noncoding region (viz. an "intron") within the gene. Preferably, it is placed upstream of the 
coding region of the gene, that is, in or near the RN A polymerase binding site for the gene, Le. the promoter. If the bind- 
ing marker gene is an occluding promoter, the target is. preferably, placed downstream of the promoter. Placement of 
the target DNA sequence relative to the promoter is influenced by two main considerations: a) protein binding should 
have a strong effect on transcription so that the selection is sensitive, b) the activity of the promoter in the absence of a 
binding protein should be relatively unaffected by the presence of the test DNA sequence connpared to any other target 
subsequence. 

[0106] In the present Invention, we will deal primarily with DNA target subsequences of 10 to 25 bases. It has been 
noted that the highly conserved -35 region and the highly conserved -10 region are separated by between 15 and 21 
base pairs with a mode of 17 base pairs (HAWL83. MULL84). Some of the bases between -35 and -10 are statistically 
non-random; thus placement of target DNA sequences longer than 10 bases between the -10 and -35 regions would 
likely affect the promoter activity independent of binding by potential-DBPs. Because quantitative relationships between 
promoter sequence and promoter strength are not well understood; it is preferable, at present to use known promoters 
and to position the target at the edge of the RNA polymerase binding site. 

[0107] Protein binding to DNA has maximum effect on transcription if the binding site is in or just down-stream from 
the prorTioter of a gene. Hoopes and McClure (HOOP87) have reviewed the regulation of transcription initiation and 
report that the LexA binding site can produce effective repression in a variety of locations in the promoter region. In a 
preferred embodiment, we place the target DNA sequences that begin with A or G so that the first 5* base of the target 
sequence is the +1 base of the mRNA. as the LexA binding site is located in the yvrD gene (HOOP87. pi 235). If the 
target sequence begins with C or T, we preferably place the target so that the first 5" base of the target is the +2 base 
of the mRNA and we place an A or G at the -i-l position. An alternative is to place the target DNA sequences upstream 
of the -35 region as the LexA binding site is located in the ssb gene (HOOP87. pi 235). 

[0108J It may be useful in early stages of the development of a DBP to have more than one copy of the target DNA 
sequence positioned so that binding of a DBP reduces transcription of the selectable gene. Multiple copies of the target 
DNA sequence enhances the sensitivity of phenotypic characteristk^ to binding of DBPs to the target DNA sequence. 
Multiple copies of the target DNA sequence are, preferably, placed in tandem downstream of the promoter. Alternatively, 
one could place one copy upstream of the promoter and one or more copies downstream. 

[0109] We arrange the genes on the plasmid or plasmids in such a way that no single deletion event eliminates both 
deleterioi^ genes without also eliminating a gene essential either to plasmid replication or cell survival. Thus, resistant 
colonies are unfikely to arise through deletions because two independent deletion events are required. Similarly, sirrul- 
taneous occurrence of one point mutation and one deletion is as unlikely as two point mutations or two deletions. 
[0110] A typical an-angement of genes on the operative cloning vector, similar to that used in Detailed Example 1 . 

is: 
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represents the promoter that initiates transcription of the aOE gene. A second copy of Px initiates transcription of 
galT.K . Py is a promoter driving tet. t is a transcriptional terminator (different terminators may be used for different 
genes), and $ is the target subsequence. PlacUVS is the lacUV5 promoter, @ represents the !acQ operator, and pdbe 
is a variegated gene encoding potential DBPs. Placement of the pdbp relative to other genes is not important because 
mutations or deletions in pdbo cannot cause false positive colony isolates. Indeed, it is not necessary that the esJbe 
gene be on the selection vector at all. The purpose of the selection vector is to ensure that the host cell survives only if 
the one of the PDBPs binds to the target sequence (fonward selection) or fails to so bind (reverse selection). The edbfi 
gene may be introduced into the host cell by another vector. 

[01 111 Two-way selections are available for both tet and oalTK (vide suora) . The orientation of each gene in the 
selection vector is unimportant because strong terminators (e^ nrnBtl. rrnBt2. phage fd terminator) are preferably 
placed at the ends of each transcription unit. That oalTK and tet are separated by essential genes, however, is of fun- 
damental importance. The sequence Qci is essential for plasmid replication, and the ame gene, the transcription of 
which is initiated by Px. is essential in the presence of Ap. Successful repression of gallK and tfil is selected with galac- 
tose and fusaric acid. No single deletion event can remove both the latter genes and allow plasmid maintenance or cell 
survival under selection. In addition, binding by a novel DBP to the Px promoter would render the cell Ap sensitive. 
These arrangements make appearance of a novel DBP that binds the target DNA more probable than any of the other 
modes by which the cells can escape the designed selections. 

Qverview: Choice of target DNA bindin g sequence for deveropment of successful nOVel PBPS : 

[01 1 2] Our goal is the development, in part by conscious design and in part by in yjyg selection, of a protein which 
binds to a DNA sequence of significance, e.g.. a structural gene or a regulatory element, and through such binding 
inhibits or enhances its biological activity. In the preferred embodiment, the protein represses transcription of a delete- 
rious element, such as a viral gene. A sufficiently long sequence could be the target of several independently acting 
DBPs 

[01 1 3] Another goal of this invention is to derive one or more DBPs that bind sequence-spedfically to any prede- 
termined target DNA subsequence. It Is not yet possible to design the DBP-domain amino-acid sequence from a set of 
rules appropriate to the target DNA subsequence. Rather, it is possible to pick sets of residues that can affect the DNA 
recognition of a parental DBP. Then, variegation of residues that affect DNA recognition coipled with selection for bind- 
ing to the target DNA subsequence can produce a novel DBP specific for the target DNA subsequence. Such a method 
is limited by the nurrfcer of amino adds that can be varied at one time. To develop a novel DBP that recognizes 1 5 bases 
could require changing 15 or more residues in the initial DBP Variegation of 15 residues through all 20 amino acids 
would produce 20^^ = 3.3 x 10^^ sequences and is beyond current technology Thus we start with the recognition 
sequence of the initial DBP. change two to five bases and select, in one or more rounds of variegation and selection, a 
novel DBP that recognizes this new target DNA subsequence. This new DBP becomes the parent to the next step in 
which the target DNA subsequence Is changed by an additional two to five bases so that a stepwise series of changes 
in binding protein and changes in target is used. It is emphasized here that; although we initially select DBPs that rec- 
ognize sequences similar to that recognized by the IDBP. the ultimate target sequence recognized by the desired final 
DBP can be completely unrelated to the recognition sequence of the IDBP 

[01141 The process of finding a DBP that recognizes a sequence within a genome is shortened if we pick 
sequences that have some similarity to the cognate sequence of the initial DBR The intent is to locate several unique 
sites in the gene which can be bound specifically by DBPs such that transcription through those sites is reduced. 
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[0115] The sequences of some regions of genes of eukaryotic pathogens vary among strains (SAAG88). To opti- 
mize the search for target sites in the gene selected for repression such that repression will be effective in all or the 
majority of strains of a pathogen, regions of conserved DNA sequence within the gene are. preferably, identified. 
[01 1 61 There may b a very small number of sequences that occur in the genome of the host ceils for which binding 
of a DBP will be lethal. For this reason, the regulatory sequences, such as promoters, of the host organism are not pre- 
ferred targets for DBP development. Preferably, the target sequence occurs only in the gene of interest. For some appli- 
cations, target sequences that occur at locations other than the site of intended action may be used if binding of a 
protein to the extra sites is acceptable. 

[0117] Preliminary elimination of non-unique sequences is done by searching DNA sequence data banks of host 
genomic sequences and bacterial strain sequences; and by searching the plasmid sequences for matches to the poten- 
tial target subsequences. Remaining potential target subsequences are then used as oligonucleotide probes in South- 
ern analyses of host genomic DNA and bacterial DNA. Sequences which do not anneal to host or bacterial DNA under 
stringent conditions are retained as target subsequences. These target subsequences are cloned into the operative 
vector at the promoters of the selection genes for DBP function, as described for the test DNA binding sequence. 
[01 1 8] Choice of target subsequences is based also on the optimal location of target sites within a gene such that 
transcription will be maximally affected. Studies of monkey L-cells show that lac repressor can bind to !a£ operator, or 
to two !ac operators in tandem, in the Lncell nucleus {HUMC87. HUMC88). Further, this binding results in repression of 
a downstream chloramphenicol acetyl transferase gene in this system, and repression is relieved by IPTG. Two tandem 
operators repress CAT enzyme production to a greater extent than a single operator. The user preferably locates two to 
four target sites relatively close to each other within the transcriptional unit. 

Overview: Selection of thft Initial DNA -Blndino Protein for Varisaation 

[0119] The choice of an initial DBP is determined by the degree of specificity required in the intended use of the 
successful DBP and by the availability of known DBPs. The present invention describes three broad alternatives for pro- 
ducing DBPs having high specificity and tight binding to target DNA sequences. The present invention Is not limited to 
these classes of initial potential DBPs. 

[01 20] A first alternative is to use a polypeptide that will conform to the DNA and can wind around the DNA and con- 
tact the edges of the base pairs. A second alternative is to use a globular protein (such as a dimeric H-T-H protein) that 
can contact one face of DNA in one or more places to achieve the desired affinity and specificity. A third alternative is 
to use a series of flexibly linked small globular domains that can make contact with several successive patches on the 
DNA. 



DNA features IrTfluencino choice of an Initial DBP: 

[01 21 1 Features of DNA that influence the choice of an initial DBP include sequence-specific DNA structure and the 
size of the genome within which the DBP is expected to recognize and affect gene expression. 
[01221 Sequence-specific aspects of DNA structure that can influence protein binding include: a) the edges of the 
bases exposed in the major groove, b) the edges of the bases exposed in the minor groove, c) the equilibrium positions 
of the phosphate and deoocyribose groups, d) the flexibility of the DNA toward deformation, and e) the ability of the DNA 
to accept intercalated molecules. Note that the sequence-specific aspects of DNA are carried mostly inside a highly 
charged molecular framework that is nearly Independent of sequence. 

[01 23] The strongest signals of sequence are found in the edges of the base pairs in the major groove, followed by 
the edges in the minor groove. The groove dimensions depend on local DNA sequence (NEIDSTb, KOUD87, ULAN87). 
[01 24] The nurrt>er of base pairs required to define a unique site depends on the size and non-randomness of the 
genome. Consider a genome of length Zg bases and consider a specific subsequence of length Q. If the genome is ran- 
dom, the subsequence is expected to occur N(Q) times, where 



2 Z, 
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From this equation, we d rive the expression Q^, which is the lower limit of the length of subsequences that are 
expected to occur once or b absent: 
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[01 25] Thus, a DNA subsequence comprising 1 2 base pairs may be unique in the sfiii genome (5x10 bp), but 
is likely to occur about 180 times in a random sequence the size of the human genome (3x10^ bp). 
[0126] "me non-random nature of DNA sequences in genomes has been shown to result in the over- arxJ under- 
representation of specific sequences. The random-genome model can under-estimate the probe length needed to 
define a unique coding sequence (LATH85). Recognition sites for certain restriction enzymes occur in clusters and are 
found much more often than expected (SMIT87). In contrast. !ac repressor binding sites in eukaryotic genomes are 
almost two orders of magnitude less frequent than expected on the basis of random sequence (SIM084). 

Protein features influencing choice of initial DBP: 

[0127] Sequence-specific binding to DNA by DBFs does not require unpairing of the bases. Most sequence-spe- 
cific binding by proteins to DNA is thought to involve contacts in the DNA major groove. 

[01 28] To be certain of unique recognition in the human genome, it is best to design a protein that recognizes 1 9 to 
21 base pairs. To contact 20 base pairs directly, a protein would need to: a) wind two full turns around the DNA making 
major groove contacts, b) make a combination of major groove and minor groove contacts, or c) contact the major 
groove at four or five places. An extended polypeptide, binding in the major groove of B-DNA, lies about 5.0 A from the 
DNA axis. One base pair and 1 1/2 amino acids extend roughly equal distances along the helix (SAEN83. p238). 
[0129] A nine residue alpha helix, such as the recognition helices of H-T-H repressors, extends about 13.5 A along 
the major groove. If residues with long side chains are located at each terminus of the helix, the helix can make contacts 
over a 20.0 A stretch of the major groove allowing six base pairs to be contacted. Parts of the DBP other than the sec- 
ond helix of the H-T-H motif can make additional protein-DNA contacts, adding to specificity and affinity. The rigidity of 
the alpha helix prevents a long helix from following the major groove around the DNA. A series of small domains, appro- 
priately linked, could wind around DNA, as has been suggested for the zinc-finger proteins (BERG88a. GIBS88. 
FRAN88). In an extended configuration a polypeptide chain progresses roughly 3.2 to 3.5 A between consecutive res- 
idues. Thus, a 1 0 residue extended protein structure could contact 5 to 8 bases of DNA. 

[01 30] Stable complexes of proteins with other macromolecules involve burial of 1 000 to 3000 A of surface area 
on each molecule. For a globular protein to make a stable complex with DNA, the protein must have substantial surface 
that is already complementary to the DNA surface or can be deformed to fit the surface without loss of much free 
energy. Considering these modalities we assign each genetically encoded polypeptide to one of three classes: 

1) a polypeptide that can easily deform to complement the shape of DNA» 

2) a globular protein, the internal structure of which supports recognition elements to create a surface complemen- 
tary to a particular DNA subsequence, and 

3) a sequential chain of globular domains, each domain being more or less rigid and complementary to a portion 
of the surface of a DNA subsequence and the domains being linked by amino acid subsequences that allow the 
domains to wind around the DNA. 

[0131] Complementary charges can accelerate association of molecules, but they usually do not provide much of 
the free energy of binding, f^ajor components of binding energy arise from highly complementary surfaces and the lib- 
eration of ordered water on th macromoleculal' surfaces. 
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Properties of sequence-specific DNA-bindina bv Dolvpeptides : 

[0132] An extended polypeptide of 24 amino acids lying In the major groove of B-DNA could make sequence-spe- 
cific interactions with as many as 15 base pairs, which is about the least recognition that would be useful in eukaryotic 
systems. Peptides longer than 24 amino acids can contact more base pairs and thus provide greater specificity. 
[0133] Extended polypeptide segments of proteins bind to DNA in natural systems (e^ X repressor and Cro, P22 
Arc and Mnt repressors). The DNA major groove can accomnxxlale polypeptides in either heiical or extended confor- 
mation. Side groups of polypeptides that lie in the major groove can nnake sequence-specific or sequence-independent 
contacts. Since the polypeptide can lie entirely within the major groove, contacts with the phosphates are allowed but 
not mandatory Thus a polypeptide need not be highly positively charged. A neutrai or siighiry posiiively charged 
polypeptide might have very low non-specific binding. 

[0134] Polypeptides composed of the 20 starxiard amino acids are not flat enough to lie in the minor groove unless 
the sequence contains an extraordinary number of glycines, however, residue side-groups could extend into the minor 
groove to make sequence-specific contacts. Polypeptides of more than 50 amino acids may fold into stable 3D struc- 
tures. Unless part of the surface of the structure is complementary to the surface of the target DNA subsequence, for- 
mation of the 3D stmcture competes with DNA binding. Thus polypeptides generated for selection of specific binding 
are preferably 25 to 50 amino acids in length. 
[01 35] Polypeptides present the following potential advantages: 

a) low molecular weight: an extended polypeptide offers the maximum recognition per amino add, 

b) polypeptides have no inherent dyad symmetry and so are not biased toward recognition of palindromic 
sequences, 

c) polypeptides may have greater specificity than globular proteins, and 

d) peptides may be good models from which other low molecular weight compounds may be designed. 

[0136] Thus, one would choose a polypeptide as initial DNA-binding molecule if high specificity and low molecular 
weight are desired. 

[0137] No sequence-specific DNA-binding by small polypeptides has been reported to date. Possible reasons that 
such polypeptides have not been found include: a) no one has sought them, b) cells degrade polypeptides that are free 
in the cytoplasm, and c) they are too flexible and are not specific enough. 

[0138] In a preferred embodiment, a DNA-binding polypeptide is associated with a custodial domain to protect it 
from degradation, as discussed more fully in Examples 3 and 4. 

Properties of globular proteins influencing cho ice of initial DBP: 

[0139] The majority of the well-characterized DBPs are small globular proteins containing one or more DNA-bind- 
ing domains. No single<iomain globular protein comprising 200 or fewer amino acids is likely to fold into a stable struc- 
ture that follows either growe of DNA continuously for 10 bases. The structure of a small globular protein can be 
arranged to hold more than one set of recognition elements in appropriate positions to contact several sites along the 
DNA thereby achieving high specificity, however, the bases contacted are not necessarily sequential on the DNA, For 
example, each monomer of X repressor contains two sequence-specHic DNA recognition regions: the recognition helix 
of the H-T-H region contacts the front face of the DNA binding site and the N-terminal arm contacts the back face. To 
obtain tight binding, a gkjbular protein must contact not only the base-pair edges, but also the DNA backbone making 
sequence-independent contacts. These sequence-independent contacts give rise to a certain sequence- independent 
affinity of the protein for DNA. The bases that intervene between segments that are directly contacted influence the 
position and flexibility of the contacted bases. If the DNA-protein complex involves twisting or bending the DNA (e^ 
434 repressor-DNA complex), non-contacted bases can influence binding through their effects on the rigidity of the tar- 
get DNA sequence. 

[0140] The phage repressors Arc. Mnt, X repressor and Cro are proposed to bind to DNA at least partly via binding 
of extended segments of polypeptide chain. The N-terminal arm of X repressor makes sequence-specific contacts with 
bases in the major groove on the back side of the binding site. The C-terminal "tail" of X Cro is proposed to make 
sequence-independent contacts in the minor groove of the DNA. The structure of neither Arc nor Mnt has been deter- 
mined; however, the sequence specificity of the N-terminal arm of Arc can be transferred to Mnt: viz^ when Arc residues 
1-9 are fused to Mnt residues 7 through the C-terminal, the fusion protein recognized the arc operator but not the mnt 
operator Residues 2. 3. 4. 5, 8, and 10 of Arc have been proposed to contact operator DNA and residue 6 of Mnt has 
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been shown to be involved in sequence-specific operator contacts. 

[0141] Birxiir^ to non-palindromic sequences requires alteration o1 dyad -symmetric proteins. Even non-palindro- 
mic DMA has approximate dyad symmetry in the deoxyribophosphate backbone; proteins that are heterodimers or 
pseudo-dimers engineered from known globular DBPs are good candidates for the mutation process described here to 

5 obtain globular prot ins that bind non-palindromic DNA. It has been observed that the DNA restriction enzymes having 
palindromic recognition are composed of dyad symmetric multimers {MCCL86). while restriction enzymes and other 
DNA-modifying enzymes (e.g. Xis of phage X) having asymmetric recognition are comprised of a single polypeptide 
chain or an asymmetric aggregate (RICH88). Such proteins may also provide reasonable starting points to generate 
DBPs recognizing non-palindromic sequences. 

10 [0142] A globular protein can bind sequence-specificaHy to DNA through one set of residues and activate transcrip- 
tion from an adjacent gene through a different set of residues (for example, X or P22 repressors). The internal structure 
of the protein establishes the appropriate geometric relationship between these two sets of residues. Globular proteins 
may also bind particular small molecules, effectors, in such a way that the affinity of the protein for its specific DNA rec- 
ognition subsequence is a function of the concentration of the particular small molecules (s^ CRP and [cAMP]). Con- 

15 ditional DNA-binding and gene activation are most easily obtained by engineering changes into known globular DBPs. 
[0143] Some DBPs from bacteria and bacteriophage have been shown to have sufficient specificity to operate in 
mammalian cells. 

[0144] An initial DBP may be chosen from natural globular DBPs of any cell type. The natural DBP is preferably 
small so that genetic engineering is facile. Preferably, the 3D structure of the natural DBP is known; this can be deter- 

20 mined from X-ray diffraction, NMR. genetic and biochemical studies. Preferably, the residues in the natural DBP that 
contact DNA are known. Preferably the residues that are involved in multimer contacts are known. Preferably the natural 
operator of the natural DBP is known. More preferably, mutants of the natural operator are known and the effects of 
these mutants on binding by natural DBP and mutant DBPs are known. Preferably, mutations of the DBP are known and 
the effects on protein folding, multimer formation, and in vivo half life-time are known. Most of the above data are avail- 

25 able for x Cro. X repressor and fragments of X repressor, 434 repressor and Cro proteins. E co|i CRP and trg repressor. 
P22 Arc, and P22 Mnt. 

[0145] Globular DBPs are the best understood DBPs. In many cases, globular DBPs are capable of sufficient spe- 
cificity and affinity for the target DNA sequence. Thus globular DBPs are the most preferred candidates for initial DBP. 
Table 8 contains a list of some preferred globular DBPs for use as initial DBPs. 

30 [0146] X repressor and phage 434 repressor have been extensively studied (CHAD71 , PTAS80. PAB079. JOHN79. 
SAUE79. SAUE86. PAB082a,b. LEWI83. OHLE83. WEIS87a,b,C. RE1D88, ANDE87, NELS86. EL1A85). Both proteins 
comprise an amino-terminal DNA-binding domein having four homok)gous alpha helices. Helices 2 and 3 form the H- 
T-H motif. DNA contacts originate in helix 2. helix 3. and adjacent regions with helix 3 providing most of the contacts. 
The N-terminai domains of X repressor contact each other along helix 5 (PAB082b) while in 434 repressor the interdo- 

35 main contacts are beyond helix 4. there being no helix 5 (ANDE87). 

[0147] The operator DNA bends symmetrically in the 434 represser-consensus operator co-crystal (ANDE87), The 
center of the 14 base pair DNA helix is over-wound and bends slightly along its axis such that it curls around the alpha 
3 helix of each repressor monomer; the ends of the operator DNA helix are undenftround. Bending of operator DNA has 
also been proposed in models of Cro protein and CAP protein operator binding (OHLE83. GART88). Consistent with 

40 the results of Gartenberg and Crolhers, bending of the 434 operator toward Cro is toward the minor groove and occurs 
most readily when the central bases consist exclusively of A and T (KOUD87): in this case, substitution of CG base 
pairs greatly reduces binding. 

[0148] X Cro (TAKE77) has been described from an X-ray structure of the protein without DNA (ANDE81). Alpha 
helix 2 lies across the operator major groove and may make contacts to operator backbone phosphates at its N-terminal 
45 and C-terminal ends. In addition, backbone phosphates may be contacted by residues at the C terminus of alpha 3, N 
terminus of b^ 2, and 0 terminus of beta 3 (PAB084). In computer model building of X Cro-operator DNA interactions, 
bending of operator DNA or bending at the monomer-monomer interface of the Cro dimer have been proposed to make 
the best fit between operator and dimer (PABG84). 

[0149] Key airtno acids within the H-T-H region of 434 Cro and X Cro are highly conserved (PAB084). and 434 Cro 
50 binds operator DNA as a dimer (WHAR85a). Because the crystals of 434 Cro and DNA do not diffract to high resolution, 
atomic details of the protein-DNA interactions are not revealed (WOLB88). Nevertheless. Wolberger et aL report very 
significant similarities and differences between the DNA binding patterns of 434 repressor and 434 Cro. These obser- 
vations on DBPs from 434, together with recent results on Trp repressor (QTWI88). support the view that a) structural 
elements that fit into the major groove of DNA can function in a variety of closely related ways, b) bending of DNA com- 
55 plexed to proteins is an important determinant of specificity, and c) that mechanisms of recognition may be quite subtle. 
[0150] Crystal structures have been determined for two DBPs, CRP (WEBE87a) and TrpR (OTWI88) from E co!i. 
Both these proteins contain H-T-H motifs and bind their cognate operators only when particular effector molecules are 
bound to the protein. cAMP for CRP and L-tryptophan for TrpR. Binding of each effector molecule causes a conforma- 
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tional change in the protein that brings the DNA-recognizing elements into correct orientation for strong, sequence-spe- 
citic binding to DNA (JOHN86). The DNA-binding function of Lac repressor is also modulated through protein binding 
of an effector molecule {qju lactose); unlike CRP and TrpR, Lac repressor binds DNA only in the absence of the effec- 
tor CRP can act either as an activator (RENYSS) or as a repressor (POLA88) depending on the relationship between 
the CRP-binding site and the rest of the promoter. 

[0151] Two structures of CRP (MCKA81. MCKA82) and one structure of a CRP mutant (WEBE87a) are available. 
Otwinowski et a!- (OTWI88) have published an X-ray crystal structure of TrpR bound to the Trp operator. This structure 
shows that, although TrpR contains a canonical H-T-H motif, the positioning of the recognition helix with respect to the 
DNA is quite different from the positioning of the corresponding helix in other H-T-H DBPs (MATT88) for which struc- 
tures of protein-DNA complexes are available. Unlike previously determined structures, most of the interactions 
between atoms of TrpR and bases are mediated by localized water molecules. H is not possible to distinguish between 
localized water and atomic ions, such as Na*, by X-ray diffraction alone. We shall follow OtwinowsW §1 al- and refer to 
these peaks In electron density as water, although ions cannot be ruled out. 

[0152] Bass ai- (BASS88) studied the binding of wild type TrpR and single amino acid missense mutants of TrpR 
to a consensus palindromic Trp operator and to palindromic operators that differ from the consensus by a symmetric 
substitution at one base in each haH operator. Bass gt aL conclude that the contact between the H-T-H motif of TrpR 
and the operators must be substantially different from the model that had been built based on the 434 Cro-DNA struc- 
ture. 

[0153] Thus the binding of globular DBPs that are nxxiulated by effector nnolecules is fundamentally the same as 
the binding of unmodulated globular DBPs. but the details of each protein's interactions with DNA are quite different. 
Prediction of which amino acids will produce strong specific binding is beyond the capabilities of current theory Given 
the important role of localized waters or ions in the TrpR-DNA interface (OTWI88) and in the 434R-DNA interface 
(AGGA88), such predictions are likely to remain beyond reach for some time. 

[0154] The Mnt repressor of P22 is an 82 residue protein that binds as a tetramer to an approximately palindronr«c 
1 7 base pair operator presumably in a manner that is two-fold rotationally symmetric. Although the Mnt protein is 40% 
alpha helical and has some homology to X Cro protein, Mnt is known to contact operator DNA by N-terminal residues 
(VERS87a) and possibly by a residue (K79) dose to the C terminus (KNIG88). It is unlikely, therefore, that an H-T-H 
structure in Mnt mediates DNA binding (VERS87a). Another residue (Y78) close to the C-terminai end has been found 
to stabilize tetramer formation (KNIG88). Though the three dimensional structure of Mnt is not known. DNA-binding 
experiments have indicated that the Mnt operator, in B-form conformation, is contacted at nr^jor groove nucleotides on 
both front and back sides of the operator helix (VERS87a). 

[0155] The Arc repressor of P22 is a 53 residue protein that binds as a dimer to a partially palindromic 21 base pair 
operator adjacent to the mQl operator in P22 and protects a region of the operator that is only partially symmetric rela- 
tive to the symmetric sequences in the operator (VERSSTb). Arc is 40% homologous to the N-terminal portion of Mnt. 
and the N-terminal residues of the Arc protein contact operator DNA such that an H-T-H binding motif is unlikely, as in 
Mnt binding (VERS86b). The three dimensional structure of Arc. tike Mnt. is not known, but a crystallographic study is 
in progress {JORD85). DNA-binding experiments have shown that Arc probably binds atong one face of B-form opera- 
tor DNA. These experiments indicate that Arc contacts operator phosphates farther out from the center of operator sym- 
metry than do ttie repressas or Cro proteins of X or 434. or P22 Mnt protein. Thus the researchers state that the 
operator DNA may be bent around Arc in binding or Arc dimer may have an extended structure to allow such contacts 
to occur (VERS87b). These alternatives are not mutually exclusive. 

DNA-Blnding Proteir^ Other Than Repressor Proteins 

[0156] Any prot«n (or polypeptide) which binds DNA may be used as an initial DNA-binding protein; the present 
method is not lirwted to repressor proteins, but rather includes other regulatory proteins as well as DNA-binding 
enzymes such as polymerases and nucleases. 

[0157] Derivatives of restriction enzymes may be used as initial DBPs. All known restriction enzymes recognize 
eight or fewer base pairs and cut genomic DNA at many places. Expression of a functional restriction enzyme at high 
levels is lethal unless the corresponding sequence-specific DNA-modifying enzyme is also expressed. EcoRI that lacks 
residues 1-29, denoted EcoRI-delN29. has no nuclease activity (JENJ86); EcoRI-delN29 binds sequence-specifically 
to DNA that includes the EcqRI recognition sequence. GAATTC. (BECK88). 

[0158] From the structure of R.EazRI (MCCL86), we can see that extension of the polypeptide chain at either the 
amino or carboxy terminus would allow contacts with base pairs outside of ttie canonical hexanudeotkJe. 
[0159] Specifically, extending EcoRI(AT139). EcoRI(GS140). or EcoRI(RQ203) (YAN087) by. for example, ten 
highly variegated residues at the amino terminus and selecting for binding to a target such as, TGAATTCA or GGAAT- 
TCC. allows isolation of a protein having novel DNA-recognition properties. Alternatively. EcoRI may be extended at the 
amino terminus by addition of a zinc-finger domain. It may be useful to have two or more tandem repeats of the octa- 
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nucleotide target placed in or near the promoter region of the selectable gene. Fox (FOXK88) has used DNase-l to foot- 
print EsfiRI bound to DNA and reports that 15 bp are protected. Thus, repeated octanucleotide targets for proteins 
derived from EgqRI should be separated by eight or more base pairs; one could place one copy of the target upstream 
of the -35 region and one copy downstream of the-lO region. There are many residues in EcoRI that contact the DNA 
as the enzyme wraps around rt. These residues could be varied to alter the binding of the protein. To obtain acceptable 
specificity, we may need to pick as initial DBP a mutant of EcoRI that folds and dimerizes. but that binds DNA weakly. 
The mutations in regions of the protein that contact DNA outside of the original GAATTC will confer the desired affinity 
and specificity on the novel protein. 

[0160] One may wish to obtain a protein that binds to one target DNA sequence, but not to other sequences that 
contain a subsequence of the target. For example, we may seek a protein thai recognizes TGAATTCA, but not any of 
the sequences vGAATTCb. To achieve this distinction, we place the target sequence in the promoter region of the 
selectable gene and one or more instances of the related sequences, to which we intend that the protein not bind, in 
the promoter region of an essential gene, such as an antibiotic-resistance gene. 

[0161] Other stable proteins may also be used as initial DBFs, even if they show no DNA-binding properties. Par- 
raga et aL (Reference 8 in PARR88) report that Eisen et aL have fused 229 residues of yeast ADRI to beta-galactosi- 
dase and that the fusion protein binds sequence-specif ically to DNA in vilrfi, 

[0162] Adenovirus El A protein turns on early viral genes as well as the human heat shock protein hsp70 (SIM088). 
Further, a normal inducible nuclear DNA-binding protein regulates the IL-2alpha interleukln-2 receptor-R(alpha) gene 
and also promotes activation of transcription from the HIV-1 virus LTR (B0HN88). These studies indicate one of the 
many difficulties of designing antiviral chemotherapy by using the transcriptional regulatory apparatus of the virus as a 
target. This invention uses unique target sequences, not represented elsewhere in the host genome, as targets for sup- 
pression of gene expression. 

[0163] The DNA sequences of operators that interact with proteins that control mating-type and cell-type specific 
transcription in yeast (MILL85) reveal that the consensus site for action of the alpha2 protein dimer is symmetric, while 
a heterodimeric complex of alpha2 and a1 subunits acts on an asymmetric site. The alpha2a1 -responsive site consists 
of a half-site that is identical to the alpha2 half -site, and another half-site that is a consensus for a1 protein binding. The 
spacings between the symmetric and asynnmetric sites are not the same. 

[0164] Antibodies that bind DNA and other nucleic acids have been obtained from human patients suffering from 
Systemic Lupus Erythematosus, taurine monoclonal antilxxlies have been obtained that specifically recognize Z-DNA. 
B-DNA. ssDNA, triplex DNA, and certain repeating sequences {ANDE88). Anderson et aL (ANDE88) report that: 1) the 
antibodies studied contact six base pairs and four phosphates, 2) antibodies are unlikely to provide some of the well 
known motifs for DNA-binding, helix-turn-helix, 3) study of DNA-antibody complexes may yieW insights into mech- 
anisms of recognition, and 4) a DNA-recognizing antibody might be converted into a sequence or structure specific 
nuclease. The shortness of the contact makes it unlikely that high specificity can be attained. 

Properties of serially-linked globular domains: 

[0165] A protein motif for DNA binding, present in some eukaryotic transcription factors, is the zinc finger in which 
zinc coordinately binds cysteine and histidine residues to form a conserved structure that is able to bind DNA 
(FRAN88)- Xenopus laevis transcription factor TFIIIA is the first protein demonstrated to use this motif for DNA binding, 
but other proteins such as human transcription factor SP1. yeast transcription activation factor GAL4. and estrogen 
receptor protein have been shown to require zinc for DNA binding in vUcq (EVAN88). Other mammalian and avian ster- 
oid hormone receptors and the adenovirus E1 A protein, that bind DNA at specific sites, contain cysteine-rich regions 
which may form metal chelating loops. 

[0166] Zinc-finger regiorts have been observed in the sequences of a number of eukaryotic DBPs, but no high-res- 
olution 3D stnjcture of a Zn-finger protein is yet available, A variety of models have been proposed for the binding of 
zinc-finger proteins to DNA (FAIR86. PARR88. BERG88. GIBS88). Model building suggests which residues in the Zn- 
f Ingers contact the DNA and these would provide the primary set of residues for variation. Berg {BERG88) and Gibson 
et aL (GIBS88) have presented models having many similarities but also some significant differences. Both models sug- 
gest that the motH comprises an antiparallel beta structure followed by an alpha helix and that the front side of the helix 
contacts the major groove of the DNA. By assuming that consen/ed basic residues of the Zn-finger make contact with 
phosphate groups in each copy of the motif, Gibson el aL deduce that the amino terminal part of the helix makes direct 
contact to the DNA. The Gibson model does not. however, account well for the number of bases contacted by Zn-finger 
proteins. The observations on H-T-H proteins suggest that a DNA-recognizing element can interact in a variety of ways 
with DNA and we assert that a similar situation is likely in Zn-finger proteins. Thus, until a 3D model of a Zn-finger pro- 
tein bound to DNA is available, all of the residues modeled as occurring on the alpha helix away from the beta structure 
should be considered as primary candidates for variegation when one wishes to alter the DNA-binding properties of a 
Zn-finger protein. In addition, residues in the beta segment may control interactions with the sugar-phosphate backbone 
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which can effect both specific and non-specific binding. 

[0167] Parraga et aL (PARR88) have reported a low-resolution structure of a single zinc-finger from NMR data. 
They confirm the alpha helix proposed by Berg and by Gibson et aL. but not the antiparailel beta sheet. The models 
proposed by Klug and colleagues (FAIR86) have a common feature that is at variance with the nxxiels of Berg and of 
Gibson et aL yiz^ that the protein chain exits each finger domain at the same end that it entered. The structure pub- 
lished by Parraga gl aL does not settle this point, but suggests that the exit strand tends toward the end opposite from 
the entrance strand, thereby supporting the overall models of Berg and of Gibson st aL Parraga el aL also report that 
a) a chimeric molecule consisting of zinc-tinger domains linked to beta-galactosidase binds sequence-specifically to 
DNA and b) a protein comprising only two finger motifs can bind sequence-specifically to DNA. They do not suggest 
thai the residues could be muiagenized to achieve novel recognition. 

[0168] A protein conposed of a series of zinc fingers offers the greatest potential of uniquely recognizing a single 
site in a large genome. A series of zinc fingers is not so well suited to development of a DBP that is sensitive to an effec- 
tor molecule as is a more conpact globular protein such as E coH CRR Positive control of genes adjacent to the target 
DNA subsequence can be achieved as in the case of TF-IIIA. 

Overview: Va riegation Strategy 

Choice of residues in oar ental potential-DBP to vary: 

[01 69] We choose residues in the initial potential-DBP to vary through consideration of several factors, including: a) 
the 3D structure of the initial DBP. b) sequences homologous to the initial DBP, c) modeling of the initial DBP and 
mutants of the initial DBP. d) models of the 3D structure of the target DNA, and e) models of the complex of the initial 
DBP with DNA. Residues may be varied for several reasons, including: a) to establish novel recognition by changing the 
residues involved directly in DNA contacts while keeping the protein structure approximately constant, b) to adjust the 
positions of the residues that contact DNA by altering the protein structure while keeping the DNA-contacting residues 
constant c) to produce heterodimeric DBPs by altering residues in the dimerization interface while keeping DNA-con- 
tacting residues constant, and d) to produce pseudo-dimeric DBPs {see betow) by varying the reskJues that join seg- 
ments of dimeric DBPs while keeping the DNA-contacting residues and other residues fixed. 
[0170] If a dimeric protein comprises two identical polypeptide chains related by a two-fold axis of rotation, we 
speak of a homodimer with two-fold dyad symmetry. When two very similar polypeptkles fold into similar domains and 
associate, we may observe tfiat there is an approximate two-fold rotational axis that relates homologous residues, such 
as the alF^ial -beta1 dimer of haemoglobin. We refer to such a protein as a heterodimer and to the symmetry axis as a 
quasi<lyad. When we produce a single-chain DBP by fusing gene fragments that encode two DNA-binding domains 
joined by a linker amino acid subsequence, we call the molecule a pseudo-dimer and the axis that relates pairs of res- 
idues a pseudo-dyad. 

Principles that guide choice of residues to varv: 

[0171] A key concept is that only structured proteins exhibit specific binding. Lfl^ can bind to a particular chemical 
entity to the exclusion of most others. In the case of polypeptides, the structure may require stabilization in a complex 
with DNA. The residues to be varied are chosen to preserve the underlying initial DBP structure or to enhance the like- 
lihood of favorable polypeptide-DNA interactions. The selection process eliminates cells carrying genes with mutations 
that prevent the DBP from folding. Genes that code for proteins or polypeptides that bind indiscriminately are eliminated 
since cells carrying such proteins are not viable. Although preservation of the basic underlying initial DBP structure is 
intended, small changes in the geometry of the structure can be tolerated. For example, the spatial relationship 
between the alpha 3 helix in one monomer of k Cro and the alpha 3 helix in the dyad-related monomer (denoted alpha 
3 ) is a candWate for variation. Small changes in the dimerization interface can lead to changes of up to s weral A in the 
relative positions of reskJues in alpha 3 arxJ alpha 3*. 

[0172] Burial of hydrophobic surfaces so that bulk water is excluded is one of the strongest forces driving the folding 
of macromolecules and the binding of proteins to other molecules. Bulk water can be excluded from the region between 
two molecules or between two portions of a single molecule only if the surfaces are complementary The double helix 
of B-DNA allows most of the hydrophobk; surface nucleotides to be buried. The edges of the bases have several hydro- 
gen-bonding groups; the methyl group of thymine is an important hydrophobic group in DNA (HARR88). To achieve tight 
binding, the shape of the protein must be higNy complementary to the DNA. all or almost all hydrogen-bonding groups 
on both the DNA and the protein must make hydrogen bonds, and charged groups must contact either groups of oppo- 
site charge or groups of suitable polarity or polarizability 

[0173] There ar two complementary interfaces of major interest: a) the DNA-protein interface and b) the interface 
between protein monomers of dim rs or between domains of pseudo-dimers. The DNA-protein interface is more polar 
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than most protein-protein interlaces, but hydrophobic amino acids (e^ R L. M. V. I. W. Y) occur in sequence-specific 
DNA-protein interfaces. Th protein -protein interfaces of natural DBFs are typical protein-protein interfaces. 
[0174] Amino acids are classified as hydrcphiiic or hydrophobic (ROSE85. EISE86a,b), and although this classifi- 
cation is helpful in analyzing primary protein structures, it ignores that the side groups may contain both hydrophobic 
and hydrophilic portions, ag^ lysine. Hydrogen borxJs and other ionic interactions have strong directional behavior, 
while hydrophobic interactions are not directional. Thus substitution of one hydrophobic side group for another hydro- 
phobic side group of similar size in an interface is frequently tolerated and causes subtle changes in the interface. For 
the purposes of the present invention, such hydrophobic-interchange substitutions are made in the protein-protein inter- 
face of DBFs so that a) the geometry of the two monomers in the dimer will change, and b) compensating interactions 
produce exclusively heterodimers. 

[0175] The process claimed here tests as many surfaces as possible to select one as efficiently as possible that 
binds to the target. The selection isolates cells producing those proteins that are more nearly complementary to the tar- 
get DNA, or proteins in which intermolecular or intramolecular interfaces are more nearly complementary to each other 
so that the protein can fold into a structure that can bind DNA. The effective diversity of a variegated population is meas- 
ured by the number of different surfaces, rather than the number of protein sequences. Thus we should maximize the 
number of surfaces generated in our population, rather than the number of protein sequences. Proteins do not have dis- 
tinct, countable surfaces; therefore, we define an interaction set as a collection of residues of a protein that can simul- 
taneously touch the target DNA. 

[0176] If N spatially separated residues of a protein are varied. 20 x N surfaces are generated. Variation of N resi- 
dues in the same interaction set yields 20^ surfaces. For example, if N = 6, variation of spatially separated residues 
yields 120 surfaces while variation of interacting residues yields 20^ - 6.4 x 10^ surfaces. The process of varying resi- 
dues In an interaction set to maximize the number of surfaces obtained is referred to as Structure-directed Mutagene- 
sis. 

[01 77] H the protein residues to be varied are close enough together in sequence that the variegated DNA (vgDN A) 
encoding alt of them can be made in one piece, then cassette mutagenesis is picked. The present invention is not lim- 
ited to a particular length of vgDNA that can be synthesized. With current technology, a stretch of 60 amino acids (180 
DNA bases) can be spanned. 

[01 78] Mutation of residues further than sixty residues apart can be achieved using other methods, such as single- 
stranded-oligonucleotide-directed mutagenesis (BOTS85) and two or more mutating primers. 
[0179] To vary residues separated by more than sixty residues, two cassettes may be mutated serially. From 2-fold 
to 1000-fold variegation is first introduced into a first cassette. We then introduce 1000-fold to 10^-fold variegation into 
a second cassette of the variegated vector populatfon. The composite level of variation preferably does not exceed the 
prevailing capabilities to a) produce very large numbers of Independently transformed cells or b) select small conrpo- 
nerrts in a highly varied population. The limits on the level of variegation are discussed below. 

Assembly of Relevant Data : 

[0180] Here we assemble the data about the initial DBF and the target that are useful in deciding which residues to 
vary in the variegation cycle: 

1) 3D structure, or at least a list of residues that contact DNA and that are Involved in the dimer contact of the Initial 
DBF. 

2) list of sequences homologous to the initial DBF. and 

3) model of the target DNA sequence. 

These data and an understanding of the function and structure of different amino acids in proteins will be used to 
answer three questions: 

1 ) which residues of the initial DBF are on the outside and close enough together in space to touch the target DNA 
simultaneously? 

2) which residues of the initial DBF can be varied with high probability of retaining the underlying Initial DBF struc- 
ture? 

3) which residues of the initial DBF can affect the dimerizatlon or folding of the initial DBF? 



24 



EP 0 452 413 B1 



[0181] Although an atomic model of the target material is preferred in such examination, it is not necessary. 
Graphical and computational tools: 

[0182] The most appropriate method of picking the residues of the protein chain at which the amino acids should 
be varied is by viewing with interactive computer graphics a model of the initial DBP complexed with operator DNA. A 
model based on X-ray data from the DNA-protein complex is preferred, but other models may be used. A stick-figure 
representation of molecules is preferred. Suitable programs for viewing and manipulating protein and nucleic acid mod- 
els include: a) PS-FRODO, written by T A. Jones (JONE85) and distributed by the Biochemistry Department of Rice 
University. Houston, TX; and b) PROTEUS, developed by Dayringer. Tramantano. and FIsttsrick (DAYR36). Any hard- 
ware that supports either of these programs is appropriate. 

Use of Knowfedae of Mutations Affectino Protein Stability 

[0183] In choosing the residues to vary and the substitutions to be made for such residues, one may make use not 
only of modelling as described above but also of experimental data concerning the effects of mutation in the initial DN A- 
binding protein. Mutations which will markedly reduce protein stability are to be avoided in most cases. 
[0184] Missense mutations that decrease DNA-binding protein function non -specif ically by affecting protein folding 
are distinguished from binding-specific mutations primarily on the basis of protein stability (NELS83, PAKU86, 
VERS86b, HECH84, HECHSSa. and HECH85b). 

[0185] Tables 1.12. and 13 summarize the results of a number of studies on single missense mutations in the three 
bacteriophage repression proteins: X repressor (Table 12) (NELS83. GUAR82. HECH85a. and NELS85). X Cro (Table 
1 ) (PAKU86. EISE85). and P22 Arc repressor (Table 13) (VERS86a, VERS86b). The majority of the mutant sequences 
shown in Tables 1,12. and 13 were obtained in experiments designed to detect loss of function in yiyo. The second-site 
pseudo-reversion mutations (HECHSSa). and suppressed nonsense mutations (NELS83), restore function, and some 
of the site specific changes (EISE85) produce functional proteins. 

[0186] Roughly 50-70% of the single missense mutations of the DNA-binding proteins selected for loss of function 
(Tables 1. 12. and 13) produce protein folding defects. 

Use of Knowledge of Mutations Affecting the DNA- Protein Interface 

[0187] Missense mutations in residues thought to be involved in specific interactions with DNA have been reported 
for several prokaryotic repressor proteins. Table 14 shows an alignment of the H-T-H DNA-binding domains of four 
prokaryotic repressor proteins (from top to bottom: X repressor. X Cro, 434 repressor and tiB repressor) and indicates 
the positions of missense mutations in residues that are solvent-exposed in the free protein txjt become buried in the 
protein-DNA complex, and that affect DNA binding. 

[0188] Randomly obtained missense mutations in solvent- exposed residues of X repressor, X Cro. and tre repres- 
sor, yield sets of mutants that reduce DNA binding (Table 14). These sets correlate well to the sets of residues that are 
proposed to Interact directly with DNA. Some nrxitations in X Cro (EISE85) and all those shown for 434 repressor 
(WHARSSa) were obtained through site-directed mutagenesis. Most of the mutations shown in the X and Itb repressor 
sequences are trans-dominant when the mutant gene is present on an overproducing plasmid (NELS83, KELL85). The 
exceptions to trans-dominance are the X repressor SP35 and the trg repressor AT80 noutations. This latter change pro- 
duces a repressor that has only slightly reduced binding (KELLB5). The trans<lominance observed for these mutations 
is proposed by the authors to result from the wild-type repressor and the mutant repressor forming mixed oligomers 
which are inactive in binding to operator sites. 

[0189] Wharton (WHARSSa) has reported that extensive site-directed mutagenesis of 434 repressor positions 28 
and 29 produced no functwnal protein sequences other than the wild-type. Apparently, in the context of 434 repressor 
structure and operators, only proteins with the wild-type Q28-Q29 sequence bind to the wiW-type operators. 
[0190] TaWe 14 also shows missense mutations that result in near normal repressor activity. Substitution of 434 
repressor Q33 with H, U V, T. or A produces repressors that function if expressed from overproducing plasmids 
(WHARSSa); repressor specificity is, however, reduced. Mutations in X repressor. QY33 (NELS83. HECH83), and in X 
Cro. YF26 (EISE85), produce altered proteins which make one less H-bond to the DNA and which bind to the operator 
DNA with reduced affinity. Thus, loss of a single H-bond is insufficient to completely abolish binding of DNA. Mutations 
YK26 and HR35 in X Cro show nearly normal binding (EISE85). 

[0191] Nelson and Sauer (NELS85) and Hecht el aL (HECHSSa, b) have described four replacements in X repres- 
sor (Table 12): EK34, GN48. GS48, and EK83. These derivates have higher affinity for OrI than w.t. X repressor. 
[0192] Extended amino acid arms at N- and C-terminal locations are important DNA-binding structures in at least 
four prokaryotic repressors: X repressor and Cro. and P22 Arc and Mnt. 
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[0193] Sequence-specific and sequence-independent contacts are made by the first 6 amino acid residues 
(STKKKP) of the \ repressor N-terminal region which form an "arm" that can wrap around the DNA (EUA85, 
PAB082a). Missense mutations KE4 and LP12 (TabI 1 2) both greatly reduce repressor activity in vjvQ (NELS83). Dele- 
tion of the first six residues results in a protein which is non -functional in vivo (ELIA85). Deletion of the first three resi- 
dues results in decrease of affinity for OrI . loss of protection of back side guanines, altered specificity between OrI 
and Or3. and decreased binding sensitivity to changes in temperature or salt concentration (ELIA85. PAB082a). 
[0194] Missense mutations of P22 Arc tfiat produce non-functional proteins with high intracellular specific protein 
levels (Table 13) are found only in the N-terminal 10 residues of the protein (VERS86b). A single residue change at 
position 6 (HP6) in P22 Mnt changes operator recognition in the altered protein (YOUD83. VERS86a.b). Knight and 
Sauer (cited in VERS86a,b) replaced the first 6 residues of Mnt repressor with the first 9 residues of .Arc repressor to 
produce a repressor that binds to the ace operator but not to the mnl operator. Thus P22 Mnt and Arc use a recognition 
region located in the first 6-10 amino-terminal residues for DNA recognition and binding. The N-terminal DNA-binding 
of these proteins can not be the recognition helix of a typical H-T-H motif. 

[0195] In X Cro. a C-termlnal sequence (K62-K63-T64-T65-A66) has been suggested on the basis of model build- 
ing (TAKE85) and NMR measurements (LEIG87) to form a flexible arm that interacts with minor groove phosphates. 
Eisenbeis and Caruthers (cited in KNIG88) have found that T64. T65. and A66 have minor effects on protein-operator 
affinity, while K63 is very Important. The C-termlnal sequence of P22 Mnt (K79-K80-T81 -T82) is almost identical to that 
of X Cro. It has been shown (KNIG88) that deletion of the three residues after K79 has little effect on protein structure 
or DNA binding. Deletion of K79 and the distal residues, however, reduces operator binding by three orders of magni- 
tude with little apparent change in protein structure. 

Use of Knowledge of Mutations Affecting the Pro tein-Protein Interface 

[0196] It is also possible to modulate DNA-binding specificity by altering the protein-protein interface. Because the 
oligomerization equilibrium is coupled to DNA binding, mutations that alter oligomerization affect operator site affinity. 
Since oligomerization involves the matching of protein surfaces, many interactions are hydrophobic and mutations 
which specifically destabilize oligomerization are similar to mutations which destabilize global protein structure. Interac- 
tions at the site of oligomerization can influence the strength of interactions at the DNA-binding site by subtie alterations 
In protein structure. 

Use of Mutations That Affect Activation 

[0197] When X. 434, and P22 repressors bind to their respective Or2 sites, they activate transcription (POTE80, 
POTE82, PTAS80). The site on X repressor which activates RNA polymerase is located on the N-terminal domain of the 
molecule (BUSH88. HOCH83. SAUE79). Activation requires contact between the N-terminal domain of repressor at 
Or2 and RNA polymerase (HOCH83. SAUE79) and this contact stimulates isomerization of the polymerase complex 
to the open form (McClure and Hawley. cited in GUAR82). 

[0198] Missense mutations in X. P22. or 434 repressors that specifically reduce Prm activation while leaving oper- 
ator binding intact are in the solvent-exposed protein surface closest to RNA polymerase bound at Prm (GUAR82, 
PAB079. BUSH88, WHAR85a). For X and 434 repressor this surface includes residues in alpha helix 2 and in the turn 
between alpha helices 2 and 3. In P22 repressor, the surface is formed at the carboxyl tenninus of alpha helix 3 
(PAB079. TAKE83). In each repressor the changes that reduce transcriptional activation at Prm involve the substitution 
of a basic residue for a neutral or acid residue. Further, missense mutations in X and 434 repressors which inaease 
transcription at Prm involve the substitution of an acidic residue for a neutral or basic residue (GUAR82, BUSH88). 
[0199] Transcriptional activation at Prm involves the apposition of a negatively charged surface on the N-terminal 
domain of X, 434. or P22 repressor to a site on RNA polymerase (BUSH88). Mutations that a) alter the negatively- 
charged surface ol repressor by removing acidic residues or by replacing them witii basic residues, or b) that position 
the negative surface incarectiy with respect to RNA polymerase, decrease transcriptional activation at Prm- Alterations 
that produce a owe negatively charged surface act to increase transcription at Prm- 

Pick principal set of residues to vary : 

[0200] A huge number of variant DNA sequences can be generated by synthesis with mixed reagents at chosen 
bases. Usually, it is necessary that the number of variants not exceed tiie number of independentiy transformed cells 
generated from the synthetic DNA. It is efficient, however, to make the number of variants as dose as practical to this 
limit. The total number of variants is the product of the number of variants at each varied codon over all the variable 
codons. Thus, we first consider which residues could be varied with an expectation that alteration could affect DNA 
binding. We then pick a range of amino acids at each variable residue. The total number of variants is the product of 
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these numbers. If the product is too large or too small, we alter the list of residues and range of variation at each variable 
residue until an acceptable number is found. 

[0201] Considering which residues are on the surface of the initial DBR we pick residues that are close enough 
together on the surface of the initial DBP to touch a molecule of the target simultaneously without having any initial DBP 
main-chain atom come closer than van cler Waals distance (viz, 4.0 to 5.0 A center to center) to any target atom. For 
the purposes of the present invention, a residue of the initial DBP "touches " the target if: 

a) a main-chain atom is within van der Waals distance, viz, 4.0 to 5.0 A. of any atom of the target molecule. 

b) the Cbeta 'S within a specific distance of any atom of the target molecule so that a side-group atom could make 



coniaci with triai aiom, ur 



c) there is evidence that altering the residue alters the DNA-binding of the initial DBR 

[0202] The residues in the principal set need not be contiguous in the protein sequence. The exposed surfaces of 
the residues to be varied need not be connected. We prefer only that the amino acids in the residues to be varied all be 
capable of touching a single copy of the target DNA sequence simultaneously without atoms overlapping. 
[0203] In addition to the geometrical criteria, we prefer that there be indications that the initial DBP structure will tol- 
erate substitutions at each residue in the principal set of residues. Indications could come from various sources, includ- 
ing homologous sequences and nrxxleling. 

Pick a secondary set of residues to vary. 

[0204] The secondary set comprises those residues not in the primary set that touch residues in the primary set. 
These residues might be excluded from the primary set because the residue is : a) internal, b) highly conserved, or c) 
on the surface, but the curvature of the initial DBP surface prevents the residue from being in contact with the target at 
the same time as one or more residues in the primary set. 

[0205] Internal residues are frequently conserved and the amino acid type can not be changed to a significantly dif- 
ferent type without risk that the protein structure wilt be disrupted. Nevertheless, some conservative changes of internal 
residues, such as I to L or F to Y. are tolerated. Such conservative changes affect the detailed placement and dynamics 
of adjacent protein residues and such variation may be useful to improve the characteristics of DBP binding. 
[0206] Surface residues in the secondary set are most often located on the periphery of the principal set. Such 
peripheral residues can not make direct contact with the target simultaneously with all the other residues of the principal 
set It is appropriate to vary the charge of some or all of these residues. For example, the variegated codon containing 
equimolar A and G at base 1 . equimolar C and A at base 2, and A at base 3 yields amino acids T, A. K. and E with equal 
probability. 

Choice of residues to vary simultaneously: 

[0207] The allowed level of variegation determines how many residues can be varied at once; geometry determines 
which ones. The user may pick residues to vary in many ways; the following is a preferred manner. The user picks the 
objective of the variegation, ^}sis. supca. 

[0208] The nunter of residues picked is coupled to the range through which each can be varied. In the first round 
progressivity is not an issue: the user may elect to produce a level of variegation such that each molecule of vgDNA is 
potentially different through, for example, unlimited variegation of 10 codons (20^° approx. = 10^^ different protein 
sequences). The levels of efficiency of ligation and transformation reduce the number of DNA sequences actually 
tested to between 10^ and 10^. Multiple performances of the process with very high levels of variegation will not yield 
repeatable results; the use decides whether this is important. 

Pick range of variation: 

[0209] Each varied residue can have a different scheme of variegation, producing 2 to 20 diff eent possibilities. We 
require that the process be progressive. each variegation cycle produces a better starting point for the next varie- 
gation cycle than the previous cycle produced. 

N.B.: Setting the level of variegation such that the parental edbe and many sequences related to the parental b*b 
sequence are present in detectable amounts insures that the process is progressive. If the level of variegation is so 
high that the frequency of the parental odba sequence can not be detected as a transformant. then each round of 
mutagenesis is independent of previous rounds and there is no assurance of progressivity. This approach can lead 
to valuable DNA-binding proteins, but multiple repetitions of the process at this level of variegation will not yield pro- 
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gressive results. Excessive variegation is not preferred in subsequent iterations of this process. 

[021 0] Progressivity is not an all-or-nothing property. So long as most of the information obtained from previous var- 
iegation cycles is retained and many different surfaces that are related to the parental DBP surface are produced, the 
process is progressive. If the level of variegation is so high that the parental dba gene may not be detected, the assur- 
ance of progressivity diminishes. If the probabitrty of recovering the parental DBP is negligible, then the probability of 
progressive results is also negligible. 

[0211] An opposing force in our design considerations is that DBFs are useful in the population only up to the 
amount thai can be detected; any excess above the detectable amount is wasted. Thus we produce as many surfaces 
related to the parental DBP as possible within the constraint that the parental DBP be present as a marker for the detec- 
tion level. 



Mutagenesis of DNA: 

[021 2] We now decide how to distribute the variegation within the codons for the residues to be varied. These deci- 
sions are influenced by the nature of the genetic code. When vgDNA is synthesized, variation at the first base of a 
codon CTeates a population coding for amino acids from the same column of the genetic code table (Table 1 6); variation 
at the second base of the codon creates a population coding for amino acids from the same row of the genetic code 
table; variation at the third base of the codon creates a population coding for amino acids from the same box. Work with 
3D protein structural models may suggest definite sets of amino acids to substitute at a given residue, but the method 
of variation may require either more or fewer kinds of amino acids be included. For example, substitution of N or Q at a 
given residue may be wanted. Combinatorial variation of codons requires that mixing N and Q at one location also 
include K and H as possibilities at the same residue. The present invention does not rely on accurate predictions of the 
amino acids to be placed at each residue, rather attention is focused on which residues should be varied. 
[0213] There are many ways to generate diversity in a protein (RICH86. CARU85, OLIP86). An extreme case is that 
one or a few residues of the protein are varied as much as possible (inlgr alia see CARU85, CARU87. RICH86. 
WHAR85a). We will call this limit "Focused Mutagenesis". When there is no binding between the parental DBP and the 
target, we preferably pick a set of five to seven residues on the surface and vary each through all 20 possibilities. 
[0214] An alternative plan of mutagenesis ("Diffuse Mutagenesis") that may be useful is to vary many more resi- 
dues through a more limited set of choices (VERS86a.b. INOU86 (Ch.15). PAKU86). This can be accomplished by spik- 
ing each of the pure nucleotides activated for DNA synthesis (ag. nucleotide-phosphoramidites) with one or more of 
the other activated nucleotides. Contrary to general practice, the present invention sets the level of spiking so that only 
a small percentage ( 1 % to .00001%, for example) of the final product will contain the parental DNA sequence. This will 
insure that the majority of molecules carry single, double, triple, and higher mutations and. as required for progressivity, 
that recovery of the parental sequence will be a possible outcome. 

(021 5] Let Nb be the number of bases to be varied, and let Q be the fraction of all DNA sequences that should have 
the parental sequence, then M, the fraction of the nucleotide mixture that is the majority component, is 

M = exp{ log JQ)/N ^ } = 10 (log io(Q)/N b). 

If. for example, thirty base pairs on the DNA chain were to be varied and 1% of the product is to have the parental 
sequence, then each mixed nucleotide substrate should contain 86% of the parental nucleotide and 14% of other nucle- 
otides. Table 1 7 shows the fraction (fn) of DNA molecules having n non-parental bases when 30 bases are synthesiz^ 
with reagents that contain fraction M of the majority component. When M=.63096. f24 and higher are less than 10' . 
Note that substantial probability for 8 or more substitutions occurs only if the fraction of parental sequence (fO) drops to 
around 10"^. 

(021 6] The Nb base pairs of the DNA chain that are synthesized with mixed reagents need not be contguous. They 
are picked so that between N^/S and Nb codons are affected to various degrees. The residues picked for mutation are 
picked with reference to the 3D structure of the initial DBP. if known. For example, one might pick all or most of the res- 
idues in the principal and secondary set. We may impose restrictions on the extent of variation at each of these residues 
based on homologous sequences or other data. The mixture of non-parental nucleotides need not be random, rather 
mixtures can be biased to give particular amino acid types specific probabilities of appearance at each codon. For 
example, one residue may contain a hydrophobic amino acid in all known homologous sequences; in such a case, the 
first and third base of that codon would be varied, but the second would be set to T This Diffuse Mutagenesis will reveal 
the subtle changes possible in the protein backbone associated with conservative interior changes, such as V to I. as 
well as some not so subtle changes that require concomitant changes at two or more residues of the protein. 
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Mutagenesis: 

[0217] if we have no information indicating that a particular amino acid or class of amino acid is appropriate, we 
approximate substitution of all amino acids with equal probability because representation of one or a few genes 
5 above the detectable level is unproductive. Equal amounts of all four nucleotides at each position in a codon yields the 
amino acid distribution: 
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[02181 This distribution has the disadvantage of giving two basic residues for every acidic residue. Such predomi- 
nance of basic residues is likely to promote sequence-independent DNA binding. In addition, six times as much R. S, 
and L as W or M occur for the random distribution. Use o1 equimolar C and G at the third base reduces the over-repre- 

20 sentation of S, R. and U but does not cure the maldistribution of acidics and basics. 

[0219] Consider the distribution of amino acids encoded by one codon in a population of vgDNA. Let Abun(x) be 
the abundance of DNA sequences coding for amino acid x. For any distribution, there will be a most-favored amino acid 
(mfaa) with abundance Abun(mfaa) and a least-favored amino acid (Ifaa) with abundance Abun(lfaa). We seek the 
nucleotide distribution that allows all twenty amino acids and that yields the largest ratio Abun(lfaa)/Abun{mfaa) subject 

25 to two constraints. First, the abundances of acidic and basic amino acids should be equal. Second, the number of stop 
codons should be kept as low as possible. Thus only nucleotide distributions that yield 

Abun(E)+Abun(D) = Abun(R)+Abun(K) 

30 are considered, and the function maximized is: 

f(distriDution) = {(l-Abun(slop)) (Abun(lfaa)/Abun(mfaa))}. 

[02201 We limit the third base to equimolar T and G (C and G would be equivalent). All amino acids are possible 
35 and the number of accessible stop codons is reduced. 

[02211 A computer program. "Find Optimum vgCodon." (Table 18). varies the composition at bases 1 and 2, in 
steps of 0.05. and reports the composition that gives the largest value of f(distribution) subject to the constraints: 

g2 = (gra2 - 0.5*ara2)/(c1 + 0.5*al), 

40 

ti = 1 - ai - cl - gl . and 
t2 = 1 - a2 • c2 - g2 . 

45 The first constrairrt requires equal amount of acidic and basic amino acids and the second and third conserve matter. 
We vary a1.c1.g1. a2. and c2 and then calculate t1 . g2. and t2. Initially, variation is in steps of 5%. Once an approxi- 
mately optifTium distribution of nucleotides is determined, the region is further explored with steps of 1%. The optimum 
distribution is: 

so Optimum voCodon 
[02221 
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(continued) 
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arxl yields DNA molecules encoding each type of amino acid with the abundances shown in Table 19. 
10223] The actual nucleotide distribution obtained in synthetic DNA will differ from the specified nucleotide distribu- 
tion due to several causes, including: a) differential inherent reactivity of nucleotide substrates, and b) differential dete- 
rioration of reagents. It is possible to compensate partially for these effects, but some residual error will occur. We 
denote the average discrepancy between specified and observed nucleotide fraction as Sem 

= square root ( average[ (f * ^specV^ spec J / 

where fobs 's the amount of one type of nucleotide found at a base and fspec is the amount of that type of nucleotide that 
was specified at the same base. The average is over all specified types of nucleotides and over a number (ag^ 1 0 to 
50) of different variegated bases. By hypothesis, the actual nucleotide distribution at a variegated base will be within 5% 
of the specHied distribution. Actual DNA synthesizers and DNA synthetic chemistry may have different error levels. It is 
the user's responsibility to determine S^,, for the DNA synthesizer and chemistry employed by the user. 
[0224] To detennine the possible effects of errors in nucleotide composition on the amino acid distribution, we mod- 
ified the program "Rnd Optimum vgCodon" in four ways: 

1) the fraction of each nucleotide in the first two bases is allowed to vary from its optimum value times (l-Serr) to 
the optimum value times (1 + S^rr) in seven equal steps (Serr is the hypothetical fractional error level), maintaining 
the sum of nucleotide fractions for one codon position at 1 .0. 

2) g2 Is varied in the same manner as a2, we dropped the restriction that 
Abun(D) + Abun(E) = Abun(K) + Abun(R) , 

3) t3 and g3 are varied from 0.5 times (1 - Serr) to 0.5 times (1 + Serr) in three equal steps. 

4) the smallest ratio Abun(lfaa)/Abun(mfaa) Is sought. 

In actual experiments, we direct the synthesizer to produce the optimum DNA distribution "Optimum vgCodon" given 
above. Incomplete control over DNA chemistry may. however, cause us to actually obtain the following distribution that 
is the worst that can be obtained if all nucleotide fractions are within 5% of the amounts specified in "Optimum vgCo- 
don". A corresponding table can be calculated for any given Serr using the program "Find worst vgCodon within Serr oi 
given distribution." given in Table 20. 

Optimum voCodon. worst 5% errors 

[0225] 
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[02261 This distribution yields DNA encoding each of the twenty amino acids at the abundances shcwvn in Table 21 . 
[0227] Each codon synthesized with the distribution of bases shown above displays 4x4x2 = 2 =32 possible 
DNA sequences, though not in equal abundances. An oligonucleotide containing N such codons would display 2 pos- 
sible DNA sequences and would encod 20^ protein sequences. Other variegation schemes produce different numbers 
of DNA and protein sequences. For example, if two bases in one codon are varied through two possibilities each, then 
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there are 2 x 2 = 4 DNA sequences and 2 x 2 = 4 protein sequences. 

[0228] If five codons are synthesized with reagents mixed so as to produce the nucleotide distribution "Optimum 
vgCodon". and if we actually obtained the nucleotide distribution "Optimum vgCodon. worst 5% errors", then DNA 
sequences encoding the mfaa at all of the five codons are about 277 times as likely as DNA sequences encoding the 
Ifaa at all of the five codons. Further, about 24% of the DNA sequences will have a stop codon in one or mor of the five 
codons. 

[0229] Consider variegation of a hypothetical sequence. F24<325-D26-E27-T28, in which each variegated codon 
is synthesized as an "Optimal vgCodon". The actual abundance of the DNA encoding each type of amino acid is. how- 
ever, taken from the case of S^,, = 5% given in Table 21. The abundance of DNA encoding the parental amino acid 
sequence is: 

Amount ( paren ta 1 seq • ) 

F24 G25 D26 E27 T28 

= Abun(F) * Abun(G) * Atoun(D) * Abun(E) * Abun(T) 
= .0249 X .0663 X .0545 X .0602 X .0437 
« 2.4 X 10^'^ 



Therefore, if the efficiency of the entire process allows us to examine 10^ different DNA sequences. DNA encoding the 
parental DBP sequence as well as very many related sequences will be present in sufficient c^antity to be detected and 
we are assured that the process will be progressive. 

Setting level of variegation: 

[0230] We use the following procedure to determine whether a given level of variegation is practical: 

1) from: a) the intended nucleotide distribution at each base of a variegated codon, and b) S^rr (the error level in 
mixed DNA synthesis), calculate the abundances of DNA sequences coding for each amino acid and stop, 

2) calculate the abundance of DNA encoding the parental DBP sequence by multiplying the abundances of the 
parental amino acid at each variegated residue. 

The abundances used in the procedure above are calculated from the worst cfistribution that is within S^rr oi the speci- 
fied distribution. A variegation that Insures that the parental DBP sequence can be recovered is practical. Such a level 
of variegation produces an enormous number o1 multiple changes related to the parental DBP available for selection of 
irrproved successful DBPs. We adjust the subset of residues to be varied and levels of variegation at each residue until 
the calculated variegation is within bourxjs. 

Reduction trf gratuitous restriction sites: 

[0231 ] If the method of mutagenesis to be used is replacement of a cassette, we consider whether the variegation 
generates gratuitous restriction sites. We reduce or eliminate gratuitous restriction sites by appropriate choice of varie- 
gation pattern and silent alteration of codons neighboring the sites of variegation. 

Focused mutaq«iesis: 

[0232] In the preferred embodiment of this process, the number of residues and the range of variation at each res- 
idue are chosen to maximize the number of DNA binding surfaces, to minimize gratuitous restriction sites, and to assure 
the recovery of the initial DBP sequence. For example, in Detailed Example 1 , the initial DBP is X Cro. One primary set 
of residues includes G1 5. Q16. K21 . Y26. 027, S28. N31 . K32. H35. A36. and R38 of the H-T-H region (Table 14b) and 
C-terminal residues K56. N61. K62. K63. T64. T65. and A66. A secondary set of residues includes L23. G24. and V25 
from the turn portion of the H-T-H region, buried residues T20. A21 . A30. 131 . A34. and 135 from alpha helices 2 and 3, 
and dimerization region residues E54, V55. F58. P59. and S60. 

[0233] Th initial set of 5 residues for Focused Mutagenesis contains residues in or near the N-terminal half erf 
alpha helix 3: Y26. Q27. S28. N31. and K32. Varying these 5 residues through all 20 amino acids produces 3.2 x 10 
different protein sequences encoded by 32^ (=3.3 x 1 0^) different DNA sequences. Since all 5 residues are in the sam 
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interaction set. this variegation scheme produces the maximum number of different surfaces. Assuming optimized 
nucleotide distribution described above and Serr = 5%, the probability of obtaining the parental sequence is 3.2 x 10'^. 
This level is within bounds for synthesis, ligation, transformation, and selection capable of examining 10® sequences of 
vgDNA. Codons for the 5 residues picked for Focused Mutagenesis are contained in the 51 bp PpuM II to Bglll fragment 
5 of the rav* gen constructed in Detailed Example 1 . 

Repetition to obtain desired decree of DNA-bindino: 

[0234] The first variegation step can produce one or more DBPs having DNA-binding properties that are satisfac- 
10 tory to the user. If the best selected DBP is not fully satisfactory, parental DBPs for a second variegaHon step are picked 
from DBPs isolated in the first variegation step. The second and subsequent variegation steps may emptoy either 
Focused or Diffuse Mutagenesis procedures on residues of the primary or secondary sets. In the prefened embodiment 
of this process, the user chooses residues and mutagenesis procedures based on the structure of the parental DBP 
and specific goals. For example, consider three hypothetical cases. 
15 [02351 In a first case, a variegation step produces a DBP with greater non-specific DNA binding than is desired. 
Information from sequence analysis and modeling is used to identify residues involved in sequence independent inter- 
actions of the DBP with DNA in the non-specific complex. In the next variegation step, some or alt of these residues, 
together with one or more additional residues from the primary set. are chosen for Focused Mutagenesis and additional 
residues from the primary or secondary sets are chosen for Diffuse Mutagenesis. 
20 [0236] In a second hypothetical case, a variegation step produces a DBP with strong sequence specific binding to 
the target and the goal is to optimize binding. In this case, the next variegation step employs Diffuse Mutagenesis of a 
large number of residues chosen mostly from the secondary set. 

[0237] In the third hypothetical case, a DBP has been isolated that has insufficient binding properties. A set of res- 
idues is chosen to include some primary residues that have not been subjected to variation, one or more primary resi- 
ts dues that have been varied previously, and one or more secondary residues. Focused Mutagenesis is perfomied on this 
set In the next variegation step. 

Overview: DNA Synthesis. Purification, and Cioninq 

30 DNA sequence design: 

[0238] The present invention is not limited to a single method of gene design. The jdba gene need not be synthe- 
sized in feto; parts of the gene may be obtained from nature. One may use any genetic engineering method to produce 
the correct gene fusion, so long as one can easily and accurately direct mutations to specific sites. In all of the methods 
35 of mutagenesis considered in tiie present invention, however, it is necessary that the DNA sequence for the idbd gene 
be unique compared to other DNA in the operative cloning vector. If the method of mutagenesis is to be replacement of 
subsequences coding for the potential-DBP with vgDNA, then tiie subsequences to be mutagenized must be bounded 
by restriction sites that are unique with respect to the rest of the vector. If single-stranded oligonucleotide-directed muta- 
genesis is to be used, then the DNA sequence of the subsequence coding for the initial DBP must be unique with 

40 respect to the rest of the vector. 

[0239] Ihe coding portions of genes to be syntiiesized are designed at the protein level and then encoded in DNA. 
The amino acid sequences are chosen to achieve various goals, including: a) expression of initial DBP intracellularly, 
and b) generation of a population of poterrtial-DBPs from which to select a successful DBP The ambiguity in the genetic 
code is exploited to allow optimal placement of restriction sites and to create various distributions of amino acids at var- 

45 iegated codons. 

Or^^anizatlon of gone synthesis: 

[0240] The present invention is not limited as to how a designed DNA sequence is divided for easy synthesis. An 
50 established method is to synthesize both strands of the entire gene in overlapping segments of 20 to 50 nucleotides 
(THER88). An alternative method that is more suitable for syntiiesis of vgDNA is similar to metiiods published by others 
(OL1P86. OLIP87, AUSU87, KARN84). Contrary to most previous workers, we: a) use two synthetic strands, and b) do 
not cut the extended DNA in the middle. Our goals are: a) to produce longer pieces of dsDNA than can be synthesized 
as ssDNA on commercial DNA synthesizers, and b) to produce strands complementary to single-stranded vgDNA. By 
55 using two synthetic strands, we remove the requirement for a palindromic sequence at the 3* end. Moreover, the overlap 
should not be palindromic lest single DNA molecules prime themselves. 

[0241] The present inv ntion is not limited to any particular method of DNA synthesis or construction. Preferably. 
DNA is synthesized on a Milligen 7500 DNA synthesizer (Milligen, Bedford, MA) by standard procedures. Synthetic 
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DNA is purified by polyacrylamide gel electrophoresis (PAGE) or high-pressure liquid chromatography (HPLC). The 
present irtvention is rKrt limited to any particular method of purifying DNA for genetic engineering. 

IDBP Gene cloning: 

10242] We clone the idbg gene using plasmids that are transformed into competent bacterial cells by standard 
methods (MANI82) or slightly modified standard methods. DNA fragments derived from nature are operably linked to 
other fragments of DNA. 

[02431 Cells transformed with the plasmid bearing the complete idbe gene are tested to verify expression of the ini- 
tial DBP Selection for plasmid presence is maintained on all media, while selections for DBP* phenotypes are applied 
only after growth in the presence of inducer appropriate to the promoter. Colonies that display the DBP* phenotypes in 
the presence of inducer and DBP'phenotypes in the absence of inducer are retained for further genetic and biochemical 
characterization. The presence of the idbg gene is initially delected by restriction enzyme digestion patterns character- 
istic of that gene and is confirmed by sequencing. 

[0244] The dependence of the IDBP* and IDBP- phenotypes on the presence of this gene is demonstrated by addi- 
tional genetic constructions. These are a) excision of the idte gene by restriction digestion and closure by ligation, and 
b) ligation of the excised idbg gene into a plasmid recipient carrying different markers and no dbg gene. Plasmids 
obtained by excising the gene confer the DBP phenotypes (ag, Tc^, Fus^. and Gal^ in Detailed Example 1). Plasmids 
obtained from ligation of jdbfito a recipient plasmid confer the DBP* phenotypes in the presence of an inducer appro- 
priate to the regulatable promoter (e^g, Tc^. Pus", and Gal" in Detailed Example 1). Finally, a most important demon- 
stration of the successful construction involves determination of the quantitative dependence of the selected 
phenotypes on the exogenous inducer concentration. 

Overview: DNA-blndlno Protein Purtflcatton and Characterization 

Isolation of IDBP: 

[0245] We purify IDBP and its derivatives by standard methods, such as those described in JOHN80. TAKE86, 
LEIG87. VERS85b. KAD086. 

Quantitation and characterization of protein-DN A binding: 

[0246] Methods that can be used to quantitate and characterize sequence-specific and sequence-independent 
binding of a DBP to DNA include: a) filter-binding assays, b) electrophoretic mobility shift analysis, and c) DNase pro- 
tection experiments. Ionic strength. pH, and temperature are important factors influencing DBP binding to DNA, Stand- 
ard conditions should correspond closely to the anticipated conditions of use. Thus, if a binding protein is intended for 
use in bacterial cells in standard culture, a reasonable range of values from which to choose standard conditions would 
be: pH=«7.5 to 8.0. 0.1 to 0.2 M KCI. and 32=' to 37°C. Assay buffers preferably include cofactors. stabilizing agents, and 
counter ions for proper DBP function. 

[0247] We prepare DNA fragments for analysis of protein-DNA binding by methods that are very similar to those 
described in MAXA77. KLEN70. RIGB77. and KIMJ87. Filter-binding assays can yield thermodynamic (Kq) and kinetic 
(ka and kd) constants and are performed by methods similar to those described by RIQQ70. and KIMJ87. Electro- 
phoretic mobility shift measurements can also yield values of K[> ka, and kd and are performed by methods similar to 
those of FRIE81 . DNase protection assays use the methods of JOHN79. MAXA77. FOXK88. We use chemical methods 
to characterize binding of proteins to DNA similar to the methods described in BRUN87, BUSH85. and JENJ86. 



Table of Examples 

Ex.1 I Protocol for developing a new DNA-binding protein with affinity for a DNA-sequence found in HIV-1, by 
variegation of X Cro. 

Ex. 2 Protocol for developing a new DNA-binding polypeptide with affinity for a DNA-sequence found in HIV- 
1 . by variegation of a polypeptide having a segment homologous with Phage P22 Arc. 

Ex. 3 Use of a custodial domain (residues 20-83 of barley chymotrypsin inhibitor) to protect a DNA-binding 
polypeptide from degradation. 
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(continued) 
Table of Examples 

Ex. 4 I Use of a custodial domain containing a DNA- recognizing element (alpha-3 helix of Cro> to protect a 
DNA-binding polypeptide from degradation, 

Ex. 5 Protocol for addition of arm to Phage P22 ARc to alter its DNA-binding characteristics. 

Ex. 6 Protocol for preparation of novel DNA-binding protein that recognizes an asymmetric DNA sequence 
and corresponds to a fusion of third zinc-finger domain of the Dfosoohila kr gene product and the DNA- 
binding domain of Phage P22 Arc. 



DETAILED EXAMPLE 1 

[0248] Below is a hypothetical example of a protocol for developing a new DNA-binding protein derived from X Cro 
with affinity for a DNA sequence found in human immunodeficiency virus type 1 (HIV-1) using E. coli K-12 as the cell 
line or strain. Further optimization, in accordance with the teachings herein, may be necessary to obtain the desired 
results. Possible modifications in the preferred method are discussed following various steps of the example. 
[0249] By hypothesis, we set the following technical capabilities: 



Yield from DNA synthesis 
500 ng/synthesis of ssDNA 100 bases long. 

10 ug/synthesis of ssDNA 60 bases long, 
1 mg/synthesis of ssDNA 20 bases long. 

Maximum oligonucleotide 
100 bases 

Yield of plasmid DNA 
1 mg/1 of culture medium 

Efficiency of DNA Ligation 
0-1 % for blunt-Wunt, 
4 % for sticky-blunt 

1 1 % for sticky-sticky. 

Yield of transformants 
5x10®/ug DNA 

En-or in mixed DNA synthesis {Sq„) 

Choice of cell line or strain: 

[0250] In this exarrple, the following E, coH K-1 2 recA strains are used: ATCC #35.882 delta4 (Genotype: W311 0 
trpC. recA rpsL. sud° delta4 (aal-chlD-pal-att..^K^J and ATCC #33.694 HB101 (Genotype: P. leuB. eroA, recA. M. 
ara. !ac>C galK. M eM. tesL supE. hsdS. (rg-. nTe ). E coN K-12 strains are grown at Zl'^C in LB broth (MANI82. p440) 
and on LB agar (addition of 15 g Bacto-agar) for routine purposes. Selections for plasmid uptake and maintenance are 
performed with addition of ampicillin (Ap) (200 ug/ml). tetracycline (Tc) (12.5 ug/ml) and kanamycin (Km) (50 ug/ml). 

Choice of initial DBP: 

(0251 ] The initial DBP is X Cro. Helix-turn-helix proteins are preferred over other known DBPs becaus more detail 
is known about the interactions of these proteins with DNA than is known for other classes of natural DBP X Cro is pre- 
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ferred over k repressor because it has lower molecular weight Cro from 434 is smaller than X Cro. but more is known 
atx}Ut the genetics and 3D structure of X Cro An X-ray structure of the X Cro protein has been published, but no X-ray 
structure of a DN A-Cro complex has appeared. A mutant of X Cro. Cro67, confers the positive control phenotype in vilEfi 
but not in vivo. Th contacts that stabilize the Cro dimer are known, and several mutations in the dimerization function 
have been identified (PAKLI86). 

[0252] By the methods disclosed herein. DBPs may be developed from Cro which recognize DNA binding sites dif- 
ferent from the X Or3 or X operator consensus binding sites, including heterodimeric DBPs which recognize non-sym- 
metric DNA birxling sites. 

Selections for phenotvpes conferred bv DBP"- function: 

[0253] Media generally are supplemented with IPTG and antibiotic for selection of plasmid maintenance. Cell back- 
ground is generally strain delta4 faalK.T.E deletion). 

a. Galactose resistance (Gal^). Galactose epimerase deficient (galE') strains of E coh (BUTT63) lyse when treated 
with galactose. Selective medium is supplemented with 2% galactose, added after autoclaving. Additional galac- 
tose, up to 8%. somewhat reduces the background of artifactual galactose-sensitive colonies. 

b. Galactose resistance selected immediately after transformation. Inducer IPTG is added to transformed cells, to 
5 X 10"* M at the start of the growth period, that allows expression of plasmid antibiotic-resistance. At 60 min after 
heat shock, cells are further diluted 10-fold into fresh LB broth containing IPTG. antibiotic to select for plasmid 
uptake (fi^ Ap or Km), and 2% galactose. Cells are grown until lysis is complete or for 3 h, whichever occurs first, 
then centrifuged at 6.000 rpm for 10 min. resuspended in the initial volume of the post-transformation growth cul- 
ture, and applied to medium for further selection. 

c. Fusaric acid resistance (Tc^. Fus^. Successful repression of tet yields resistance to lipophilic chelating agents 
such as fusaric acid (Fus" phenotype). Medium described by MAL081 is used for selection of fusaric acid resist- 
ance in E CQlL: the amount of fusaric acid may be varied. Total cell inoculum is not greater than 5x10® per plate. 

d. Fusaric acid resistance and galactose resistance. Galactose at a final concentration of 2% is added to the 
medium described by MALOSI after autoclaving. Cells selected directly for galactose resistance in liquid following 
transformation are applied to this medium. 

Selections for phenotvoes conferred bv DBP" function: 

[0254] Cell background is generally strain HB101 (galK ). Media are generally supplemented with IPTG and antibi- 
otic for plasmid maintenance. 

a. Tc resistance. Medium, usually LB agar, is supplemented with Tc after autoclaving. Tc stock solution is 12.5 
mg/ml in ethanol. It is stored at -20° C, wrapped in aluminum foil. Petri plates containing Tc are also wrapped in foil. 
Minimum inhibitory concentration is 3.1 ug/ml using a cell inoculum of 5 x 10^ to 10^ per plate. More stringent selec- 
tions enploy upto 50 ug/ml Tc. When used for selection of plasmid maintenance, Tc concentration is 12.5 ug/ml. 

b. Galactose utilization. Minimal A Medium (MILL72. p432). with galactose as carbon source: after autoclaving add 
(per liter) 1 ml 1 M MgS04, 0.5 ml of 10 mg/ml thiamine HCl. 10 ml of 20% galactose, and amino acids as required. 
Cell inoculum per plate is less than 5x10^. 

c. Tc resistance and galactose utilization. Mecfium A with galactose (section b. above) is supplemented with Tee at 
3-1 ug/m*. 

Selectable svstenas for DBP isolation: 

[0255] The M gene from pBR322 and the E c^ aalTK genes are used in a gal deletion host strain for selection of 
DBP function. pKK175-6 (BR0S84; Pharmacia. Piscataway. NJ). a pBR322 derivative, contains the replication origin. 
b!a (confers Ap^ for selection of plasmid maintenance, and tet. one of the two selectable genes (Rgure 3.) In pKKI 75- 
6, teL is promoterless, and all DNA upstream of th pBR322 t_et coding region that potentially allow transcription In both 
directions (BROS82) have been deleted and replaced by the M1 3 mp8 polylinker. The polylinker and t_ei are flanked by 
strong transcription t rminators from cQli rmB. tel is placed under control of th Tn5 ngQ pronroter. Pneo- 
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[0256] Plasmid pAA3H (ligure 4) (ATCC #37,308) (AHME84) provides the second set oi selectable genes. galT.K. 
In gal deleted hosts (such as strain ATCC #35.882 carrying the delta4 deletion (E soli dfiMD) plasmid pAA3H confers 
the Apf^ To® Gai^ phenotype (AHME84) because part of galE is deleted. The ga!I and gaiK genes in pAA3H are tran- 
scribed from the Pi "antitet" promoter (BROS82). In E coh strains carrying gaTT or ga!K mutations (e^ strain HB101), 
pAA3H confers Gal*. We place oair and qalK * under control of the pBR322 ame gene promoter. 
[0257] For both te| and gia! systems, positive selections are used to select cells that either express or do not express 
these genes from cultures containing a vast excess of cells of the opposite phenotype. 

Placement of test DNA binding sequence: 

[0258] The test DNA binding sequence for the IDBP, X Or3 (KIMJ87), is placed so that the first 5' base is the +1 
base of the mRNA transcribed in each of the tet and gal transcription units (Table 100 and Tat>le 101). 

Engineering the rdbo oene: 

[0259] A DNA sequence encoding the wild-type Cro protein is designed such that expression is controlled by the 
lacUVS promoter. The DNA sequence departs from the wild-type cro gene sequence by the introduction of restriction 
sites. Thus, the gene is called rav. The transcriptional unit comprising PtacUV5 . ray; and trgA terminator is shown in 
Table 102. 

Vector construction: 

[0260] The construction of an operative cloning vector is summarized in Figure 5. The gal region of pAA3H requires 
manipulation before Insertion into pKK175-6. First the A.<lerived DNA between Hgat and EcoRI is replaced with a Oal 
linker (New England BioLabs, #1037). Standard methods are used and the resulting plasmid is named pEPIOOl (Fig- 
ure 6). All plasmids cited in the present application are catalogued in Table 103. 

[0261] Next, we insert a synthetic fragment, shown in Table 104. cooprising the phage W terminator and two 
restriction sites (Seel and Sfil) into the £!al site of pEPIOOl ; the resulting plasmid is named pEP1002 (Rgure 7). Next, 
we replace the Pi promoter upstream of gal with Pa^p from pBR322. As shoen in Table 1 00. X Or3 Is positioned down- 



Stream of Eamp so that it can be used to detemiine whether binding of Cro can prevent transcription of galT.K. Restric- 
tion sites are provided to allow later alteration of the target sequence. The synthetic fragment Is cloned into pEP1002 
between Dralli and BamHI. The resulting plasmid is named pEPIOOS and confers Gal^ on de!ta4 cells 

^^^^^^^ A _ ^ ^ » i ^ ■ ^ . 



[0262] The gal genes with the promoter and the fd terminator are nrxjved from pEP1003 into pKKI 75-6. The 2.69 
kb qalTK -bearing Hpa l fragment of pEPIOOS is ligated to DNA obtained from pKK175-6 by partial Dtal digestion. Gal* 
colonies of transformed HB101 cells are picked. The resulting plasmid is named pEP1004 (Figure 9). 
[0263] The Tn5 neo gene promoter and Or3 are synthesized (Table 101) and inserted upstream of the tei coding 
region of pEP1004 between the unique and Smal sites. Plasmid DNA from Ap" Tc^ Gal^ colonies of trans- 

formed delta4 cells is analyzed for an insert in the EcoRI-EcoRV fragment of pEP1004. The resulting 7.1 kb plasmid. 
with two separate selectable gene systems under control of two different pronxrters and the test DNA binding 
sequence, is designated pEPlOOS (Figure 10). 

Cloning the idbo aene: 

[0264] The Bam HI site in the tet gene is removed from the tet gene in pEPIOOS by site-directed mutagenesis; the 
sequence TGG-ATC-CTC that codes for W97-I98-L99 is changed to TGG-ATA-TTG. DNA from pEP1005 is linearized 
with EsqRV and part (ffli. 10%) of the DNA is made single stranded with exonuclease III. The mutagenic oligonuclotide 
shown in Table 1 05 is annealed to the DNA that is then completed with Klenow enzyme and ligated. Plasmid DNA from 
Tc", Gal'' colonies of transfonned HB101 Is analyzed by standard means; the resulting plasmid is named pEP1006. 
[0265] Synthetic DNA containing a Sgel overhang, followed by sequences for the lacUVS promoter, a ribosome 
binding site, cloning sites for jdba the tr^ terminator (ROSE79), and an Sfit restricted end complementary to the Sfil 
site in pEP1005 is synthesized as six oligonucleotides as shown in Table 107. We use the methods of THER88 to 
anneal and ligate these fragments Into SCfil. Slil cut pEP1006. Plasmid DNA from Ap^. Tc". Gal^ colonies of trans- 
formed delta4 cells is examined for the Seel-Sfil insertion by restriction with Sfiel. BstE". Bgl". !<BIll. and Sfil. The 
Inserted DNA is verified by DNA sequencing, and the 7.22 kb plasmid containing the proper insertion Is designated 
pEP1007. shown in Figure 11. 

[0266] The idbe gene sequence specifying the Cro* protein and designated rav in this ExampI . is inserted in two 
cloning steps. The BstEII-Bglll segment of rav (Table 109) is inserted first. Oligonucleotides olig#14 and olig#15 are 
synthesized, annealed, and filled in wHh Klenow enzyme {QL KARN84). The dsDNA is cut wit BslEII and Bglll and 
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ligated to BstEH-M" cut pEP1007. The plasmkj containing the appropriate partial rav sequence is designated 
pEPIOOa 

[0267] The Bqlli-Kpnl fragment of rav is synthesized and inserted in the same manner as the BstE"-iglH fragment. 
(See Table 1 1 0.) This plasmid can-ying the complete rav gene is designated pEP1 009. shown in Figure 1 2. 

Determine whether IDBP is expressed: 

[0268] To determine whether cells carrying pEP1009 display the phenotypes expected for rav expression, the 
cieita4 strain bearing pEP1009 is tested on various Ap containing selective media with and without IPTG. Cells are 
w streaked on LB agar media containing: a) Tc; b) fusaric acid; or c) galactose (vide supra). Control strains are the dfilM 
host with no plasmid, and with pEPl005. pBR322, or pAA3H. 

[0269] The results below indicate that the rav gene is expressed and the gene product Is functional, and that 
expression is regulated by the lacUVS promoter 



15 Growth of derivatives of strain delta4 on selective media f+ Ap) 
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[0271] X or phage is streaked on each of the above strains, on LB agar with Ap. and with and without IPTG. At suf- 
35 f iciently high intracellular levels of Cro protein, binding of the Cro repressor protein to the X phage operators Or and Ol 
prevents phage growth. Data indicating correct expression and function of the ray gene are: 

Growth of X cl* on delta4 cells 

40 [0272] 
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phage growth 




+IPTG 


-IPTG 
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PEP1009 
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pEP1005 
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+ 



[0273] These procedures indicate that the chosen IDBP, the product of the rav gene, is expressed and is success- 
fully repressing both the test operators on the plasmid and the wild type operators on the challenge phage. 

55 

DBP purification: 

[0274] Proteins are purified as described by Leighton and Lu (LEIG87). 
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Quantitation of DBP binding: 

[0275] We measure DBP binding to the target operator DNA sequence with a filter binding assay, initially using f IKer 
binding assay corxiitions similar to those described for X Cro (KIMJ87). Data are analyzed by the methods of RIGG70 
and KIMJ87. 

[0276] The target DNA for the assay is the 1 13 bp Apa l-Rsal fragment from plasmid pEPl009 containing X Or3. A 
control DNA fragment of the same size, used to determine non-specific DNA binding, contains a synthetic Agal-Xbal 
DNA fragment specifying the amp promoter and the sequence 

5' CTTATACACGAAGCGTGACAA 3* . This sequence preserves the base content of the Or3 sequence but lacks 
several sites of conserved sequence required for X Cro binding (KtMJ87) and is cloned between the ADal-)^l sites of 
the pEPi009 backbone to yield pEPIOIO. 

Media Formulations: 

[0277] Gal^ is demonstrable in LB agar and broth at very low concentrations (0.2% galactose), and is optimal at 2 
to 8% galactose. Galactose and Tc selections are performed in LB medium. Fus^ is best achieved in the medium 
described by Maloy and Nunn (MAL081) for E coN K-12 strains. 

Induction of DBP expression: 

[0278] The pdbp gene is regulated by the lacUVS promoter. Optimal irxluction is achieved by addition of iPTG at 5 
X 1 0""* M (MAUR8Q). Experimentation for each successful DBP determines the lowest concentration that is sufficient to 
maintain repression of the selection system genes. 

Optimization of selections 

[0279] For each selective medium used to detect I DBP function, factors are varied to obtain a maximal number of 
transformants per plate and with a minimal number of false positive artifactual colonies. Of greatest importance in this 
optimization is the transcriptional regulation of the initial potential -DBP, such that in further mutagenesis studies, de 
novo binding at an intermediate affinity is compensated by high level production of DBP 

Regulation of IDBP: 

[0280] Cells carrying pEPl009 are grown in LB broth with IPTG at 10'^. 5 x 10'^, 10'^ 5 x lO"^. IQ-"^ and 5 x lO'"^ 
M. Samples are plated on LB agar and on LB agar containing fusaric acid or galactose as described in above. All media 
contain 200 ug/ml Ap. and the IPTG concentration of the broth culture media are maintained in the respective selective 
agar media. 

[0281] The IPTG concentration at which 50% of the cells survive is a measure of affinity between IDBP and test 
operator, such that the lower the concentration, the greater the affinity. A requirement for low IPTG, ag, 1 0'^ M. for 50% 
survival due to Rav protein function suggests that use of a high level, 0^5 x ^0^^ M IPTG. employed in selective media 
to isolate mutants displaying noya binding of a DBP to target DNA. will enable isolation of successful DBFs even if 
the affinity is low. 

Concentration of selectiv e agents and cell inoculum size: 

[0282] Fusaric add and galactose content of each medium is varied, to allow the largest possible cell sample to be 
applied per Petri plate. This objective is obtained by applying samples of large numbers of sensitive cells (e^ 5x10, 
10®, 5 x 10^) to plates with elevated fusaric add or galactose. Resistant cells are then used to determine the effidency 
of plating. An acceptable efficiency is 80% viability for the resistant control strain bearing pEPIOOS in a de!ta4 back- 
ground. The total cell inoculum size is increased as is the level of inhibitory conpound until viability is reduced to less 
than 80%. 

Choice and cloning of target sequences: 

[0283] Sequences of the human immunodefidency virus type 1 (HIV-1) genome were searched for potential target 
sequences. The known sequences of isolates of HIV-1 were obtained from the GENBANK version 52.0 DNA sequence 
data base. First we found non-variable regions of HIV-1. We examined the HIV-1 genome from the TATA sequence in 
the 5'LTR of the HIV-1 genome to the end of the sequence coding for the M and tts second exons. We intented to locate 
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non-variable regions where a DBP can interfere with the production of tat and/or trs mRNA because the products of 
these genes are essential in production of virus (DAYT86, FE1N86). 

[0284] HIV-1 isolate HXB2 (RATN85) from nucleotide number 1 through 6100 is the reference to which we aligned 
all other HIV-1 isolates using the Nucleic Acid Database Search program (derived from FASTN (LIPM85)) in the 
IBI/Pustelt Sequence Analysis Programs software pa.^ge (International Biotechnologies. Inc.. New Haven, CT). All 
stretches of at least 20 bases which have no variation in sequence among all HIV-1 isolates were retained as targets. 
[0285] From the alignment, segments of the HIV-1 isolate HXB2 sequence that are non-variable among all HIV-1 
sequences searched are: 
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[0286] In the present Example, these potential regions were searched for subsequences matching the central 
seven base pairs of the X operators that have high affinity for X Cro (viz. Or3, the symmetric consensus, and the Kim 
et aL consensus (KIMJ87)). The consensus sequence of Kim et aL has higher affinity for Cro than does Or3 which is 
the natural X operator having highest aff inrty for Cro, Cro is thought to recognize seventeen base pairs, with side groups 
on alpha 3 directly contacting the outer four or five bases on each end of the operator. Because the composition and 
sequence of the inner seven base pairs affect the position and flexibility of the outer five base pairs to either side, these 
bases affect the affinity of Cro for the operator. 

[0287] The sequences sought are shown in Table 11 1. The letters "A" and -S" stand for antisense and sense. 
"OpSA/Symm. Consensus.5" is a composite that has Or3A at all locations except 5. where it has the symmetric con- 
sensus base. C. Similarly. "0R3A/Symm. Consensus.6" has the symmetric consensus base at location 6 ,and Or3A at 
other locations. 

[0288] A FORTRAN program searched the non-variable HIV-1 subsequence segments for stretches of seven 
nucleotides of which at least five are G or C and which are flanked on either side by five bases of non-variable HIV-1 
subsequence. The 427 candidate seven-base-pair subsequences obtained using these constraints on CG content were 
then searched for matches to either the sense or anti-sense strand sequences of the five seven-base-pair subse- 
quences listed above. None of the HIV-1 subsequences is identical to any of the seven-base-pair subsequences. Three 
HIV-1 subsequences, shown in Table 112, were found that match six of seven bases. Eight subsequences, shown in 
Table 1 1 3, were found that match five out of seven bases and that have five or more GC base pairs. These HIV-1 sub- 
sequences are less preferred than the HlV-1 subsequences that match six out of seven bases. 



39 



EP0 452 413B1 



1 

11111111 
12345678901234567 
agiiT TecGCTaaG GaCt 
acttU ccGCTaa aaaqt 
agtcc ccGCTgq qqact 
tatcAfiSseaagGgata 



Bases 353-*369 
Left symaetrized 
Right symnetrized 
Or3 



(Lower case letters are palindromic in the two halves oi the targets and Or3; highly conserved bases are bold and 
marked thus a.) Among the outer five bases of each half operator, bases 1 and 3 are palindromically related to bases 
1 7 and 1 5 in Target HIV 353-369. 



1 

TCTCG AcGCAaG ACTCG 

tctcgAcGCAaGcqaqa 
cgaqt AcGCAaG actcq 
tatcAcCGCAAqGqata 



Eases 681-697 
Left symmetrized 
Right symnetrized 
Or3 



None of bases 1 -5 are palindromically related to bases 1 3-1 7 in Target HIV 681 -697. 



I 

TTTG AcTAGCGaA GGCT 
tttq acTAGCGQt caaa 
aacct cTAGCGa aaQCt 

tatc Ac<pGCAAQG Qata 



Bases 760-776 
Left symmetrized 
Right symmetrized 

Or3 



None of bases 1-5 are palindromically related to bases 13-17 in Target HIV 760-776. 

[0289] There is extensive sequence variability among the twelve phage X operator half-sites. For example: 

I 

tAtCafiCSCCfifitCaTa Consensus 

tAtcasfflssaafigcaxa 0r3a 



The bases in lower case in Consensus and Or3 sequences shown above are more variable among various lairtxioid 
operators than are bases shown by upper case letters. Studies o1 mutant operators indicate that A2 and C4 are required 
for Cro binding. In Target HIV 353-369. bases T3. C6. C7. G8. C9. G14. and A15 match the symmetric consensus 
sequence, but the highly conserved A2 and C4 are different from lambdoid operators and Cro will not bind to these sub- 
sequences. Mutagenesis of the DNA-contacling residues of alpha 3 is thus the first step in producing a DBP that rec- 
ognizes the left symmetrized or right symmetrized target sequences. 

[0290] Target HIV 353-369 is a preferred target because the core (underlined above) is highly similar to the Kim el 

aL consensus. Target HIV 760-776 is preferred over Target HIV 681 -697 because it is highly similar to Or3. 

[0291] The method of the present invention does not require any similarity between the target subsequence and 
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the original binding site of the initial DBP. The fortuitous existence of one or more subsequences within the target genes 
that has similarity to the original binding site of the initial DBP reduces the number of iterative steps needed to obtain a 
protein having high affinity and specrficity for binding to a site in the target gene. 

[0292] Since the target sequence is from a pathogenic organism, we require that the chosen target subsequence 
be absent or rare in the genome of the host organism, e^the target subsequences chcsen from HIV should be absent 
or rare in the human genome. 

[0293] Candidate target binding sites are initially screened for their frequency in primate genomes by searching all 
DNA sequences in the GENBANK Primate directory (2.258,436 nucleotides) using the jBt/Pustetl Nucleic Acid Data- 
bass Search program to locate exact or close matches. A similar search is made of the con sequences in the GEN- 
BANK Bacterial directory and in the sequence of the plasmid containing the idbp gene. The sequences of potential sites 
for which no matches are found are used to make oligonucleotide probes for Southern analysis of human genomic DNA 
(SOUT75). Sequences which do not specifically bind human DNA are retained as target binding sequences. 
[0294] The HIV 353-369 left symmetrized and right symmetrized target subsequences are inserted upstream of the 
selectable genes in the plasmid pEP1009. replacing the test sequences, to produce two operative cloning vectors, 
pEPIO1 1 and pEP1012. for development of RavL and Ravpt DBPs. The promoter-test sequence cassettes upstream of 
the tfit and gal operon genes are excised using Stul-HioilM and Aeal-Xfejal restrictions, respectively Replacement pro- 
moter-target sequence cassettes are synthesized and inserted into the vector, replacing Or3 with the HIV 353-369 left 
or right symmetrized target sequence in the sequences shown in Table 100 and Table 101. 

Choice of residues in Cro to vary: 

[0295] The choice of the principal and secondary sets of residues depends on the goal of the mutagenesis. In the 
protocol described here we vary, in separate procedures, the residues: a) involved in DNA recognition by the protein, 
and b) Involved in dimerization of the protein. In this section we identify principal and secondary sets of residues for 
DNA recognition and dimerization. 

Pick principal set for DNA-recoanition: 

[0296] The principal set of residues involved in DNA-recognitlon is defined as those residues which contact the 
operator DNA in the sequence-specific DNA-protein complex. Although no crystal stajcture of a X Cro-operator DNA 
complex is available, a crystal structure of a complex between the structural homolog 434 repressor N-terminal domain 
and a consensus operator has been described (ANDE87). A crystal structure of Cro dimer has been detemnined 
(ANDE81) and modeling studies have suggested residues that can make sequence-specific or sequence-independent 
contacts with DNA in sequence-specific corrplexes (TAKE83. OHLE83, TAKE85. TAKE86). Isolation and characteriza- 
tion of Cro mutants have identified residues which contact DNA in protein-operator complexes (PAKU86. HOCH86a,b, 
EISE85). 

[0297] Important contacts with DNA are made by protein residues in and around the H-T-H region and in the C-ter- 
minal region, Hochschild fit aL {HOCH86a.b) have presented direct evidence that Cro alpha helix 3 residues S28. N31 . 
and K32 make sequence-specific contacts with operator bases in the major groove. Mutagenesis experiments (EISE85. 
PAKU86) and modeling studies {TAKE85) have implicated these residues as well. In addition, these studies suggest 
that H-T-H region residues Q16. K21. Y26. Q27. H35. A36. R38, and K39 also make contacts with operator DNA. In the 
C-terminal region, mutagenesis experiments (PAKU86) and chemical modification studies (TAKE86) have identified 
K56. and K62 as making contacts to DNA. In addition, computer modeling suggests that the 5 to 6 C-terminal amino 
acids of k Cro can contact the DNA along the minor groove (TAKE85). From these considerations, we select the follow- 
ing set of residues as a principal set for use in variegation steps intended to modify DNA recognition by Cro or nruitant 
derivative proteins: 1 6. 21 , 26. 27. 31 . 32, 35. 36. 38. 39. 56. 62. 63. 64. 65. 66. 

Pick secondary set for DNA recoanitton: 

[0298] The residues in the secondary set contact or otherwise influence residues in the principal set. A secondary 
set for DNA recognition includes the buried residues of alpha helix 3: A29. 130, A33. and 134. Interactions between bur- 
ied residues in alpha helix 2 and buried residues in alpha helix 3 are known to stabilize H-T-H structure and residues in 
the turn between alpha helix 2 and alpha helix 3 of H-T-H proteins are conserved among these proteins (PTAS86 p102). 
In A. Cro these positions are T1 7. T1 9, A20, L23. G24. and V25. Changes in the dimerization region can influence bind- 
ing. In X Cro, residues thought to be involved in dimer stabilization are E54. V55. and F58 (TAKE85. PAB084). Rnally. 
residues influencing th position of the C-terminal arm of X Cro are P57. P59. and S60. Thus the secondary set of res- 
idues for use in variegation steps intended to modify DNA recognition by X Cro or Rav proteins is: 17. 19, 20. 23. 24. 
29. 30. 33. 34. 54. 55. 57. 58, 59 and 60. 
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Pick principal set for dimerization: 

[0299] Different principal and secondary sets of residues must be picked for use in variegation steps intended to 
alter dimer interactions. In X Cro. antiparallel interactions between E54. V55. and K56 on each monomer have been pro- 
posed to stabilize the dimer (PAB084). In addition, F58 from one monomer has been suggested to contact residues in 
the hydrophobic core of the second monomer. Inspection of the 3D structure of X Cro suggests important contacts are 
made between F58 of one monomer and 140. A33. L23. V25. E54. and A52. In addition, residues L7. 130. and L42 of 
one monomer could make contact with a large side chain positioned at 58 in the other monomer. Thus, a set of principal 
residues includes: 7. 23, 25. 30. 33, 40, 42. 52. 54, 55. 56. and 58. 

Pick sscsnd ary sst for dimerizatiDn: 

[0300] The secondary set of residues for variegation steps used to alter dimer interactions includes residues in or 
near the antiparallel beta sheet that contains the dimer forming residues. Residues in this region are E53, P57. and 
P59. Residues in alpha helix 1 influencing the orientation of principal set residues are K8. A1 1 . and M12. Residues in 
the antiparallel beta sheet formed by the beta strands 1 . 2. and 3 (see Table 1 ) in each monomer also influence residues 
in the principal set. These residues include 15, T6, K39. F41. V50. and Y51. Thus the set of secondary residues 
includes: 5. 6, 8. 1 1, 12, 41. 50. 51. 53, 57. and 59. 

Pick the range of variation for alteration of DNA binding: 

[0301 ] For the initial variegation step to produce a nxxlif ied Rav protein with aHered DNA specrticity a set of 5 res- 
idues from the principal set is picked. Focused Mutagenesis is used to vary all five residues through all twenty amino 
acids. The residues are be picked from the same interaction set so that as many as 3.2 x 10^ different DNA binding sur- 
faces will be produced. 

[0302] A number of studies have shown that the residues in the N-terminal half of the recognition helix of an H-T-H 
protein strongly influence the sequence specificity and strength of protein binding to DNA (HOCH86a.b, WHAR85. 
PAB084). For this reason we choose residues Y26. Q27. S28, N31 , and K32 from the principal set as residues to vary 
in the first variegation step. Using the optimized nucleotide distribution for Focused Mutagenesis described above, and 
assuming that Sgrr = 5% as defined at the start of this Example, the parental sequence is present in the variegated mix- 
ture at one part in 3.1 x 10^ and the least favored sequence. F at each residue, is present at one part in10 . Thus, this 
level of variegation is well within bounds for a synthesis, ligation, transformation, and selection system capable of exam- 
ining 5x10^ DNA sequences. 

Pick the range of variation of residues for alteration of dimerization: 

[0303] As desaibed In the Detailed Desatption and in this Example, altered X Cro proteins, RavL and Ravp, that 
bind specifically and tightly to left and right symmetrized targets derived from HIV 353-369. are first developed through 
one or more variegation steps. Site-specific changes are then engineered into to produce dimerization defective 
proteins. Structure-directed Mutagenesis is performed on ravp, to produce mutations in Ravp that can complement 
dimerization defective Rav|_ proteins and produce obligate heterodimers that bind to HIV 353-369. 
[0304] One of the interactions in the dimerization region of X. Cro is the hydrophobic contact between residues V55 
of both monomers. The VF55 mutation substitutes a bulky hydrophobic side group in place of the smaller hydrophobic 
residue; other substitutions at residue 55 can be made and tested for their ability to dimerize. A small hydrophobic or 
neutral residue present at residue 55 in a protein encoded on expression by a second gene may result in obligate com- 
plementation of VF55: In addition, changes in nearby components of the beta strand. E53, E54, K56. and P57 may 
effect conpiementation. Thus a set of residues for the initial variegation step to alter the RavR dimer recognition is 53. 
54, 55, 56, and 57. 

[0305] Another interaction in the dimerization region of X Cro is the hydrophobic contact between F58 of one mon- 
omer with the hydrophobic core of the other monomer. As mentioned above residues L7. L23. V25. A33. 140. L42, A52. 
and E54 of one monomer all could make contacts with a large residue at position 58 in the other monomer. The FW58 
mutation inserts the largest aromatic amino acid at this position. Compensation for this substitution may require several 
changes in the hydrophobic core of the complementing monomer. Residues for Focused Mutagenesis in the initial var- 
iegation step to alter RavR dimer recognition in this case are: 23. 25, 33. 40, and 42. 

[0306] In each of the two cases described above, the initial variegation step involves Focused Mutagenesis to alter 
5 residues through all twenty amino adds. As was shown in Section 6.2.5, this level of variegation is within the limits set 
by using optimized codon distributions and the values for S^rr and transformation yield assumed at the start of this 
Example. 
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Mutagenesis of DNA: 

[0307] Codons encxxling X Cro residues Y26. Q27. S28. N31 . and K32 are contained in a 51 bp PpuM I to Bgl II frag- 
ment of the rav gene. To produce the cassette containing the variegated codons we synthesize the 66 nucleotide anti- 
sense variegated strand, olig#50, and the primer. olig#52: 



d 


1 
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X 


X 


X 


a 


i 


X 


22 


23 


24 
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26 


27 


28 


29 


30 


31 


GAC 


CTA 


GGG 


GTG 


fzk 


fzk 


fzk 


GCG 


ATT 


fzk 



Xaihagrki 

32 33 34 35 36 37 38 39 40 
fzk GCC ATC CAT GCC GGC CGA AAG ATC Tt 3 ' oligfSO 

3**ccg get ttc tag aacgccgtg-5* olig<52 

The position of the amino acid residue in X Cro is shown above the codon for the residua Unaltered residues are indi- 
cated by their lower case single letter amino add codes shown above the position number Variegated residues are 
denoted with an upper case, bold X. The restriction sites for PpuM I. and Bglll are incficated below the sequence. Since 
restriction enzymes do not cut well at the ends of DNA fragments, 5 extra nucleotides have been added to the 5* end of 
the cassette. These extra nucleotides are shown in lower case letters and are removed prior to ligating the cassette into 
the operative vector. The sequence "fzk" denotes the variegated codons and indicates that nucleotide mixtures opti- 
mized for codon positions 1 . 2. or 3 are to be used. T is a mixture of 25% T, 18% C, 26% A. and 30% G. producing four 
possibilities, "z" is a mixture of 22% T 16% C, 40% A. and 22% G. producing four possibilities, "k" is an equimolar mix- 
ture of T and G. producing two possibilities. Each "fzk" codon produces 4x4x2 = 2^ = 32 possible DNA sequences 
coding on expression for 20 possible amino acids and stop. The DNA segment above comprises (2^^ = 2^^ = 3.2 x 10 
different DNA sequences coding on expression (or 20^ = 3.2 x 10® different protein sequences. 
[0308J After synthesis and purification of the variegated DNA. the oligonucleotides #50 and #52 are annealed and 
the resulting superoverhang is filled in using Klenow fragment as described by Hill (AUSU87. Unit 8.2). The double 
stranded oligonucleotide is digested with the enzymes PpuM I and Bglll and the mutagenic cassette is purified as 
described by Hill. The mutagenic cassette is cloned into the vectors pEPI 01 1 and pEP1012 which have been digested 
with PpuM I and BamH I. and the ligation mixtures containing variegated DNA are used to transform competent dfiBM 
cells. The transformed cells are selected for vector uptake and for successful repression at low stringency as described 
above. Cells containing Rav proteins that bind to the left or right symmetrized targets display the Tc^. Fus and Gal 
phenotypes. 

[0309] Sun^iving colonies are screened for con-ect DBP* and DBP' phenotypes in the presence or absence of IPTQ 
as described abova Relative measures of the strengths of DBP-DNA interactions in vivQ are obtained by comparing 
phenotypes exhibfted at reduced levels of iPTG. DBP genes from clones exhitnting the desirable phenotypes are 
sequenced. Plasmid numbers from pEP1 100 to pEP1 199 are reserved for plasmids yielding ravL genes encoding pro- 
teins that bind to the Left Symmetrized Targets carried on the plasmids. Similarly, plasmid numbers pEP1200 through 
pEP1299 plasmids containing ravp, genes encoding proteins that bind to the Right Symmetrized Targets carried on 
these plasmids. 

[031 0] Based on the determinations above, one or more RavL and RavR proteins are chosen for further analysis in 
vitro . Proteins are purified as desaibed above. Purified DBPs are quantitated and characterized by absorption spec- 
troscopy and polyacrylamide gel electrophoresis. 

[0311] In vitro measurements of protein-DNA binding using purified DBPs are performed as described in the Over- 
view; DNA-Binding, Protein Purifrcation. and Characterization and in this Example. These measurements determine 
equilibrium binding constants (Kq). and the dissociation (k^) and association (kj rate constants for sequ nce-specif ic 
and sequence-independent DBP-DNA complexes. In addition. DNase protection assays are used to demonstrate spe- 
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erf tc DBP binding to the Target sequences. 

[0312] Estimates of relative DBP stability are obtained from measurements of the thermal denaturation properties 
of the proteins. In vrtro measures of protein thern^l stability are obtained from determinations of protein circular dichr- 
oism and resistance to proteolysis by thermolysin at various temperatures (HECH84) or by differential scanning calor- 
imetry (HECH85b). 

[0313] One or more iterations of variegation, involving residues thought capable of influencing DNA binding, of the 
ravi_ and rgvp genes produce Rav^ and RavR proteins that bind tightly and specifically to the HIV 353-369 left and right 
symmetrized targets. Additional variegation steps, to optimize protein binding properties can be performed as outlined 
in the Overview: Variegation Strategy. 

[0314] By hypothesis, we isolate pEP1127 that contains a pdbp gene that codes on expression for RaV|_-27, shown 
in Table 1 14. that binds the left-symmstrizsd target best among selected RavL proteins. Similarly. pEPl238 contains a 
Ddbo gene that codes on expression for Ravq-38. shown in Table 115. that binds the right-symmetrized target best 
among selected Ravq proteins. 

[0315] We now use the genes for the Ravp and RavL monomers as starting points for production of obligately het- 
erodimeric proteins RavLiRavR that recognize the HIV 353-369 target. First we change the target sequences in 
pEPl238 (containing ravR-38). We replace both occurrences of the Right Symmetrized Target (in Ifil and gallK pro- 
moters) with the HIV 353-369 target sequence. Delta4 cells containing plasmids carrying the HIV 353-369 targets dis- 
play the Ap*^. Tc", Fus^ and Gal® phenotypes. Plasmids carrying HIV 353-369 targets and the ravR gene are 
designated by numbers pEPI 400 through pEPI 499 and con-esponding to the number of the donor ptasmid of the 1200 
series; for example, replacing the target sequences in pEP1238 produces pEPl438. 

Engineering dimerization muta nts of Rav^: 

[031 6] To create the site specific VF55 and FW58 mutations in ravL we synthesize the two mutagenesis primers; 

a e e I k p f 
52 53 54 55 56 57 58 
5» GGC GAA GAG TTC AAG CCC TTC 3* VF55 

primer 

V )c p M P » n 

55 56 57 58 59 60 61 
5* GTA AAG CCC IS£ CCC AGT AAC 3» FW58 

prlx&er 



Underlining indicates the varied codons and residues. The plasmid pEPll27 (containing ravu-27) is chosen for muta- 
genesis. The gene fragmerrt coding on expression for the carboxy-terminal region of the RavL protein is transferred into 
M13mp18 as a Bam HI to Kgnl fragment. Oligonucleotide<Jirected mutagenesis is performed as described by Kunkel 
(AUSU87. Unit 8.1). The fragment bearing the modrtied region of RavL is removed from M13 RF DNA as the BamHI to 
Kpn l fragment and ligated into the correct location in the pEP1 100 vector. Mutant-bearing plasmids are used to trans- 
form competerrt cells. Transformed cells are selected for plasmid uptake and screened for DBP' phenotypes (Tc . Fus , 
and Gal^ in E. colidelta4 : Gal* in E^cjzli HB101). Plasmids isolated from DBP- cells are screened by restriction analysis 
for the presence of the ravL gene and the site-specific mutation is confined by sequencing. The plasmid containing the 
raV|_-27 gene with the VF55 mutation is designated pEP1301. Plasmid pEP1302 contains the ravL-27 gene with the 
FW58 alteration. 

[0317] For the production of obligate heterodimers as described below, the rgvL- genes encoding the VF55 or 
FW58 mutations are excised from pEP1301 or pEP1302 and are transferred into plasmids containing the gene for Km 
and neomycin resistance (neo, also known as ngt H). These constructions are performed in three steps as outlined 
below. First, the neo gene from Tn5 coding for Km^ and contained on a 1 .3 Kbp Hindi" to Snial DNA fragment is ligated 
into the plasmid pSP64 (Promega, Madison, Wl) which has been digested with both Hindlll and Smal. The resulting 4.3 
kbp plasmid. pEP1303. confers both Ap and Km resistance on host cells. Next, the b!a gen is removed from pEPl303 
by digesting the plasmid with Aalll and Bgll. The 3.5 Kbp fragment resulting from this digest is purified, the 3* overhang- 
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ing ends are blunted using T4 DNA polymerase (AUSU87. Unit 3.5). and the fragment is recircularized. This plasmid is 
designated pEPl304 and transforms cells to Km resistance. In the final step, the r_svL gene is incorporated in to 
PEP1304. Ptasmid pEP1301 or pEP1302 is digested with Sfil and the resulting 3' overhangs are blunted using T4 DNA 
polymerase. Next the linearized plasmid is digested with Sfigl and the resulting 5' overhangs are blunted using the Kle- 
now enzyme reaction (KLEN70). The ca. 340 bp blunt-ended DNA fragment containing the entire ravj.; gene is purified 
and ligated into the Pyull site in pEP1304. Transformed cells are selected for Km" and screened by restriction digest 
analysis for the presence of ravL* genes. The presence of ravj_' genes containing the site-specific VF55 or FW58 muta- 
tions is confirmed by sequencing. The ptasmid containing the ravL'gene with the VF55 mutation is designated 
pEPia05. The plasmid containing the ravt ' gsrss with the FVV5o mutations is designated ptKl306. 
[0318] In a manner similar to the constructions described above, we ligate the original unmodified gene into 
pEP1304 to produce plasmid pEP1307. 

Engineerina heterodimer binding of target DNA: 

[0319] This round of variegation is performed to produce mutations in RavR proteins that complement the dimeri- 
zation deficient mutations in the RavL proteins produced above. To complement the FW58 mutation, the set of five res- 
idues L23, V25. A33. 140. and L42 are chosen from the primary set of residues as targets for Focused Mutagenesis. 
[0320] In an initial series of procedures to test for recognition of HIV 353-369 by the heterodimer RavtiRavR. we 
transform cells containing pEP1438 (containing ravR-38 and HIV 353-369 targets) with pEP1307 (containing rayj. 
Intracellular expression of mi, and lavf, produces a population of dimeric repressors: RavL:RavL. RavL:RavR and 
RavpiRavR. If the heterodimeric protein is formed and binds to HIV 353-369. ceils expressing both rgv alleles will exhibit 
the Km" Ap" Gal" Fus" phenotypes (vide infra) . Several pairs of and cavp genes are used in parallel procedures; 
the best pair is picked for use and further study. Selections for binding the HIV 353-369 target by the heterodimeric pro- 
tein can be optimized using this system. 

[0321] Focused Mutagenesis of residues 23. 25. 33. 40. and 42 requires the synthesis and annealing of two over- 
lapping variegated strands because in the rav gene a single cassette spanning these residues extends from the Ml 
site to the Bam HI site and exceeds the assumed synthesis limit of 100 nucleotides. As no variegation affects the over- 
lap, the annealing region is complementary. The antisense strand of the DNA sequence from the Ml site blunt end to 
the end of the codon for G37 is denoted oiig#53. 

qtktakdXgXyq 
16 17 18 19 20 21 22 23 24 25 26 27 
5» C CAA ACC AAG ACA GCG AAG GAG fzk GGG fzk TAT GAG 
I Ball I 

sainkXihag 
28 29 30 31 32 33 34 35 36 37 
AGO GCG ATT AAC AAG fzlc ATC CAT GCC GGC 3» olig#53 



f = (26% T. 18% C, 26% A, 30% G) 
z = (22% X 16% C, 40% A, 22% G) 
k = equimolar T and G 

Olig#53 contains vg codons for residues 23. 25, and 33. 

[0322] Olig#54 Is the sense strand from base 1 in codon 34 to the BamHI site: 
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i h a g r k X 

34 35 36 37 38 39 40 
3» TAG GTA CGG CCG GCA TTC jqm 

f X t i n a d n k 

41 42 43 44 45 46 47 48 49 
AAG jqm TGG TAA TTG CGA CTA CCT AGG cca ca 51 olig|54 



j = (26% A, 18% G, 16% T. 30% C) 
q = (22% A. 16% G. 40% T. 22% C) 
m = equimolar A and C 

Olig#54 contains variegated codons for residues 40 and 42. Since olig#54 is the sense strand, the variegated nucle- 
otide distributions must conrplennent the distributions for codon positions 1 , 2. and 3 used in the antisense strand. These 
sense codon distributions are designated "\\ "q". and "m". and represent the complements to the optimized codon dis- 
tributions developed for codon positions 1 . 2, and 3. respectively, in the antisense strand. The two strands (olig#53 and 
olig#54) share a 12 nucleotide overlap extending from the first position in the codon for 134 to the end of the codon for 
G37. The overlap region is 66% G or C. 

[0323] The two strands shown above are synthesized, purified, annealed, and extended to form dsDNA. Following 
restriction endonudease digestion and purification, the mutagenic cassettes are ligated into pEP1438 (containing the 
asymmetric HIV 353-369 target) in the appropriate locus in the ravR gene. The ligation mixtures are used to transform 
corrpetent cells that contain pEP1306 (the plasmid with the ravt gene carrying the FW58 site-specific mutation). 
[0324] Above we picked a set of five residues in X Cro. E53. ESS, VSS, K56. and P57. as targets for focused muta- 
genesis in the first variegation step of the procedure to produce a Ravp protein that complements the dimerization-defi- 
cient VF55 RavL mutation. These five residues are contained on a 71 bp BamHI to SiQl fragment of the rav gene (Table 
100). To produce a cassette containing the variegated codons we synthesize olig#58: 

gsvyaXXXXXf 

48 49 50 51 52 53 54 55 56 57 58 
5« ct gat GGA TCC GTC TAC CCG fzlc fzk fzk fzk fzlc TTC 

p s n k k 

59 60 61 62 63 
CCG AGT AAC AAA AAA 

t t a . 
64 65 66 67 
ACA ACA GCG TAA TAGTAGGTACC t:a 3* ollgiSS 
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[0325] After synthesis and purilication of the vgDNA. strands are self-annealed using the 10 nucleotide palindrome 
at the 3* end of the sequence. The resulting superover hangs are filled in using the Klenow enzyme reaction as 
described previously and the double-stranded oligonudeotide is digested with fiamtH and KoqI- Purified mutagenic 
cassettes are ligated into one or more operative vectors {picked from the pEP1200 series) in the appropriate locus in 
the ravp, gene. The ligation mixtures are used to transform competent cells that contain pEP1305 (the plasmid carrying 
the ravi gene with the FV55 mutation). 

[0326] Operative vectas carrying the VF55 or FW58 mutation in ravL confer Km resistance. Operative vectors car- 
rying mutagenized rav^ genes contain the gene for Pp^ as well as the selective gene systems for the DBP* pheno- 
types. Cells containing complementing mutant proteins are selected by requiring both Ap^^ and Km and repression of 
the complete HIV 343-369 target sequence (substituted for the Left and Right Symmetrized Targets in the selection 
genes) Cells possessing the desired phenotype are Ap^^. Km*^. Fus^, and Gal" (in E^eoH fijfi!la4). 
103271 Plasmids from candidate colonies are first isolated genetically by transformation of cells at low plasmid con- 
centration. Cells carrying plasmids coding for Ravu proteins will be Km", while cells carrying plasmids coding for RavR 
proteins will be Ap". Plasmids are individually screened to ensure that they confer the DBP" phenotype and are char- 
acterized by restriction digest analysis to confirm the presence of ravL^ or ravR genes. Plasmid pairs are co-tested for 
complementation by restoration of the DBP* phenotype when both i^r and are present intracellularly. Success- 
fully complementing plasmids are sequenced through the rav genes to identify the mutations and to suggest potential 
locations for optional subsequent rounds of variegation. 

[0328] Plasmids containing genes for altered RavR proteins that successfully complement the ravL VF55 mutation 
are designated by plasmid numbers pEPISOO to pEP1599. Similarly, plasmids containing genes for altered Ravp pro- 
teins that successfully complement the rav^ FW58 mutation are designated by plasmid numbers pEP1600 to pEP1 699. 
[0329] Heterodimeric proteins are purified and their DNA-binding and thermal stability properties are charactenzed 
as described above. Pairwise variation of the RavR and RavL monomers can produce dimeric proteins having different 
dimerization or dimer-DNA interaction energies. In addition, further rounds of variegation of either or both monomers to 
optimize DNA binding by the heterodimer. dimerization strength or both may be performed. 

[0330] In this manner a heterodimeric protein that recognizes any predetermined target DNA sequence is con- 
structed The foregoing is hypothetical. The sequences shown as the resuH of selection are given by way of example 
and must not be construed as predictions that proteins of the stated sequence will have specific affinity for any DNA 
sequence. 

Example 2 

[0331] Presented below is a hypothetical example of a protocol for developing new DNA^Dlnding polypeptides, 
derived from the first ten residues of phage P22 Arc and a segment of variegated polypeptide with affinity for DNA sub- 
sequences found in HIV-1 using E gs^ K12 as the host cell line. Some further optimization, in accordance with the 
teachings herein, may be necessary to obtain the desired results. Possible modifications in the preferred method are 
discussed immediately following the hypothetical example. 

[0332] We set the same hypothetical technical capabilities as used in Detailed Example 1 . 
Overview: 

[0333] To obtain significant binding between a genetically encoded polypeptide and a predetermined DNA subse- 
quence the surfaces must be complementary over a large area. 1000 A^ to 3000 A^. For the binding to be sequence- 
specific, the contacts must be spread over many (1 2 to 20) bases. An extended polypeptide chain that touches 1 5 base 
pairs comprises at least 25 amino acids. Some of these residues will have their side groups directed away from the DNA 
so that many diHerent amino acids will be allowed at such residues, while other residues will be involved in direct DNA 
contacts and will be strongly constrained. Unless we have 3D structural data on the binding of an in.tal polypeptide to 
a test DNA subsequence, we can not a BDori predict which residues will have their side groups directed toward the DNA 
and which will have their side groups directed outward. We also can not predict which annino acids should be used to 
specifically bind particular base pairs. Current technology allows production of 10^ to 10^ independent transformants 
per ug of DNA which allows variation of 5 or 6 residues through all twenty amino acids. Alternatively, between 23 and 
30 two-way variations of DNA bases can be applied that will affect between 8 and 30 codons. 
[0334] Sauer and colleagues (VERS87b) have shown that P22 Arc binds to DNA using a motif other than H-T-H. 
There is as yet no published X-ray structure of Arc. though the protein has been crystallized and diffraction data have 
been collected (JORD85). A combination of genetics and biochemistry indicates that the first 10 residues of each Arc 
monomer (M-K-G-M-S-K-M-P-O-F) bind to palindromically related sets of bases on either side of the center of symme- 
try of the 21 bp operator shown in Table 200. Furthermore, the first ten residues of each Arc monomer assume an 
extended conformation (VERS87b). The hydrophobic residues may be involved in contacts to the rest of the protein, but 
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there are several examples from H-T-H DBPs of hydrophobic side groups being in direct contact with bases in the major 
groove. We do know that these first ten residues of Arc can exist in a conformation that makes sequence-specific favo- 
rable contacts with the ac£ operator. 

[0335] We pick a target DNA subsequence from the HIV-1 genome such that a portion of the chosen sequence is 
similar \o one half -site of the arc operator We use part of this chosen sequence for an initial chimeric target. One half 
of the first target is the DNA subsequence obtained from HIV-1 and the other half of the target is one half-site of the arc 
operator. For this example, we will use a plasmid bearing wild-type arc operators repressed by the Arc repressor as a 
control. After demonstrating that Arc repressor can regulate the selectable genes, we replace the wild-type arc operator 
with the target DNA subsequence, we then replace the arc gene with a variegated pdbp gene and select tor ceils 
expressing DBPs that can repress the selectable genes. 

[0336] Once a protein is obtained that binds to the target that has similarity to one half of the Arc operator, we can 
change the target so that rt has less similarity to one half of the Arc operator and mutagenize those residues that cor- 
respond to residues 1 -10 of Arc. jn vjvQ selection will isolate a protein that binds to the new target. A few repetitions of 
this process can producer a polypeptide that binds to any predetermined DNA sequence. 

[0337] Our potential DNA-binding polypeptide (DBP) will be 36 residues long and will contain the first ten residues 
of Arc which are thought to bind to part of the half operator. DNA encoding the first ten amino acids of Arc is linked at 
the 3' terminus of this gene fragment to vgONA that encodes a further 26 amino acids. Twenty-four of the codons 
encode two alternative amino acids so that a^^* = approx. 1.6x10^ protein sequences result. The amino acids encoded 
are chosen to enhance the probability that the resulting polypeptide will adopt an extended structure and that it can 
make appropriate contacts with DNA, The Chou-Fasman (CHOU78a. CHOU78b) probabilities are used to pick amino 
acids with high probability of forming beta structures (M. V. I. C. F. Y, Q. W, R. T): the amino acids are grouped into five 
classes in Table 16. In addition, to discourage sequence-independent DNA binding, some acidic residues should be 
included. Glutamic acid is a strong alpha helix former, so in early stages we use D exclusively. Further. S and T both 
can make hydrogen bonds with their hydroxyl groups, but T favors extended structures while S favors helices; hence we 
use only T in the initial phase. Ukewise. N and Q provide similar functionalities on their side groups, but 0 favors beta 
and so is used exclusively in initial phases. Positive charge is provided by K and R. but only R is used in the variegated 
portion. Alanine favors helices and is exduded. P kinks the chain and is allowed only near the carboxy terminus in initial 

iterations. . 
[0338] After one selection, we design a different set of binary variegations that includes the selected sequence and 
perform a second mutagenesis and selection. After two or more rounds of diffuse variegation and selection, we choose 
a subset of residues and vary them through a larger set of amino acids. We continue until we obtain sufficient affinity 
and specificity for the target None of the polypeptides discussed in this example is likely to have a defined 3D structure 
of its own. because they are all too short Even i1 one folded into a definite structure, that structure is unlikely to be 
related to DNA-binding. A 3D structure, obtained by X-ray diffraction or NMR. of a DNA-polypeptide complex would give 
us useful indications of which residues to vary. Scattering the variegation along the chain and sampling different 
charges, sizes, and hydrophobicities produces a series of proteins, isolated by in vjvo selection, with progressively 
higher affinity for the target DNA sequence. 

Construction of the test plasmid: 

[0339] Selection systems are the same as used in Example 1 . viz. fusaric acid to select against cells expressing 
the Ifil gene and galactose killing by oalTK in a galE deleted host. First, in three genetic engineering steps, we replace: 
a) the rav gene in pEP1009 with the aifi gene, and b) the target DNA sequences (both occurrences) with the ar£ oper- 
ator. The resulting plasmid is our wild type control. .... . 

[0340] To replace gv with arc, the synthetic arc gene, shown in Table 201 and Table 202. is synthesized and ligataJ 
into pEPlOOS that has been digested with ME" and KdqI. Cells are transformed and colonies are screened for Tc . 
The plasmid is named pEP2000. Delta4 celts transformed with pEP2000 are Tc" and Gal^ because pEP2000 lacks the 
rav gene 

[0341 ] To insert the arc operator into the ngo promoter (Pneo) for the t_et gene in pEP2000. we digest pEP2000 with 
StuI and Hindlll and ligate the purified backbone to annealed synthetic olig#430 and olig#432. 
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10 



15 



20 



Arc operator and P^eo "that promotes i£t 

5' |CCT|GCG|AAClCGG|AAT|TGC|CAG|- 

01 ig #430 - 3* gga cgc ttg gcc tta acg gtc- 

l_stul I L-35_ ! 

I CTG jGGG I CGC| CCT \ CTG | GTA | AGG |TTG | - 

gac ccc gcg gga gac cat tec aac~ 

I -19 I 

|GGA|ATG|ATA|GAAlGCA|CTC|TAC|TAT|A 3*-01ig#432 

cot tac tat ctt cgt gag atg ata t teg a 5' 
J Are operator I |Hind3 I 



The plasmid is named pEP2001 and confers Fus^. Gal^, Ap'^ on delta4 cells. 
25 [0342] To Insert the arc operator into the am promoter for the oalTK genes in pEP2001 , we digest pEP200l with 
Apa l and )^l and ligate the purified backbone to synthetic oiig#4l6 and olig#4l7 that have been annealed in the 
standard way. 



30 



35 



Arc operator and P^p that promotes galXi K 

5' |CTT|CTA|AAT|ACA|TTC|AAA|- 
Ol.ig#417 3* c egg gaa gat tta tgt aag ttt- 
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lTATlGTA|TCClGCTlCAT|GAG|ACAlATAjACC|- 

ata cat agg cga gta etc tgt tat tgg- 

|CTT|ATG|ATA|GAA|GCAlCTC|TAClTAT| CGT 3'01ig#416 
gaa tac tat ctt cgt gag atg ata gca gat c 5' 
J Arc operator I I XfeftI I 



50 



The plasmid is named pEP2002 and confers Gal^^. Fus^, Ap^^ on deita4 cells. This plasmid is our wild type for work with 
polypeptides that are selected for binding to target DNA subsequences that are related to the acc operator. 



Dftvf^lnpment of polypeptides that bind chimeric target DNA: 

55 

[0343] We now replace: 



a) the two occurrences of the arc operator with the first target sequence that is a hybrid of the ate operator and a 
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subsequence picked from HlV-1 , and 

b) the ar£ gene by a variegated pdbp gene. 

[0344] A hybrid non-paltndromic target sequence is used in this example because selection of a polypeptide using 
a palindromic or nearly palindromic target DNA subsequence is likely to isolate a novel dimeric DBR The goal of this 
procedure is to isolate a polypeptide that binds DNA but that does not directly exploit the dyad symmetry of DNA. The 
binding is most likely in the major groove, but the present invention is not limited to polypeptides that bind in the major 
groove. The selections are performed using a non-symmetric target to avoid isolation of novel dimers that support two 
symmetrically related copies of the original recognition elements. 

[0345] Uie non-variabie regions of the hlv-l genome, as listed in Example 1. were searched using a half operator 
from the arc operator as search sequence. 

[0346] We sought subsequences in the non-variable sequences of the HIV-1 genome that match either half of the 
consensus P22 am operator shown in Table 200. Subsequences that are closer to the start of transcription are pre- 
ferred as targets because proteins binding to these subsequences will have greater effect on the transcription of the 
genes. No sequence was found that matched all six unambiguous bases; the subsequences at 1024, 1040. and 2387 
(shown in Table 203) each have a single mismatch. Lower case letters in the " arcO =" sequence indicate ambiguity in 
the P22 arc operator sequence. Lower case, bold, underscored letters in the HIV-i subsequences indicate mismatch 
with the consensus ar£ operator. Two other subsequences, shown in Table 203, have one mismatch at one of the con- 
served bases and one mismatch with one of the ambiguous bases. The HIV-1 subsequence that starts at base 1024 is 
chosen as a target sequence. We replace the 3' ten bases of the arc operator with the 3' ten bases of this subsequence 

to produce the hybrid target sequence: 

ATGATAGAAGICIGCAACCCTC . We insert this sequence into the promotor that regulates lei in pEP2002 by 
ligating dsDNA corrposed of an equlmolar mixture of olig#440 and oiig#442 into the Stul/Hindlll site of pEP2002- Sub- 
stitution of the arc operator by the arc-HIV-l hybrid sequence relieves the repression by Arc. The construction is called 
pEP2003 and confers Tc^. Ap^, Gal® on delta4 cells. 

First Target and P^eo promotes Jtfit 

5' lCCT|GCG|AAC|CGG|AAT|TGC|CAG|- 

01ig#440 » 3* gga cgc ttg gcc tta acg gtc- 

i -35 I 

CTG|GGGlCGC|CCTlCTG(GTAlAGG|TTG|- 

gac ccc gcg gga gac cat tec aac-* 

■IP I 



ATA ATA GAG TAg caa ccc tct - HIV 1024-1044 
lGGA|ATS|ATAlaAAlfi£glcaa|cccltct|A 3»-01ig#442 
cct tac tat ctt cgC GTT GGG AGA t teg a 5» 
J First Target | 1 Hin<l3 I 



[0347] IhQ second instance of the target is engineered in like manner, using pEP2003 first digested with Apal and 
Xba l and then ligated to annealed olig#444 and olig#446. The plasmid is called pEP2004 and confers Gal*. Tc", Ap^ 
on HB101 cells. The plasmid pEP2004 contains the first target sequence in both selectable genes and is ready for intro- 
duction of a variegated pdbp gene. 
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First Target and P^^p that promotes gfllT > K 

5« |CTT|CTA|AXT|ACA|TTC|AAA| 
Olig#444 3' c egg gaa gat tta tgt|aag ttt| 

i Anal I I -35 I 



|TATlGTA|TCClGCT|CAT|GAG|ACAjATA|ACClCT- 
ata cat agg cga gta etc tgt tat tgg ga 

I -10 1 



T|ATSlATAl£AA|S£glcaa|cccltctl CGT 3'01ig#446 
a tac tat ctt cgC GTT GGG AGA gca gat c 5* 
J First Target L 1 Xbft I I 



10348] The variegated DNA for a 36 amino acid polypeptide is shown in Table 204. This DNA encodes the first ten 
amino acids of P22 Arc followed by 26 amino acids chosen to be likely to form extended structures, in Table 204, we 
indicate variegation at one base by using a letter, other than A. C. G. or T to represent a specific mixture of deoxynu- 
cleotlde substrates. The range of amino acids encoded is written above the codon number: 

I|M 

I "I 
lATsl 



indicates that the first base is synthesized with A. the second base wHh T, and the third base with a mixture of C and Q, 
and that the resulting DNA could encode amino acids I or M. That the parental protein has isoleucine at residue 1 1 is 
indicated by writing I first. Residues 22 and 23 are not variegated to provide a homologous overlap region so that 
olig#420 and ollg#42l can be annealed. After olig#420 and olig#421 are annealed and extended with Klenow fragment 
and alt four deoxynudeotide triphosphates, the DNA is digested with both BstEII and BsuSgl and ligated into pEP2004 
that has also been digested with BstEII and Bsu36 l. The ligated DNA. denoted vg1-pEP2004. is used to transform 
Deita4 cells. After an appropriate grow out in the presence of IPTG. the cells are selected with fusaric acid and galac- 
tose. 

10349] By hypothesis, we recover ten colonies that are Gal^ and Fus^. We sequence the plasmid DNA from each 
of these colonies. A hypothetical DBP amino add sequence from one of these colonies is shown In Table 205. 
[0350] Conparison of the amino-add sequences of different isolates may provide useful information on which res- 
idues play crudal roles in DNA binding. Should a residue contain the same amino add in most a all isolates, we might 
infer that the selected amino acids is preferred for binding to the target sequence. Because we do not know that all of 
the isolates bind in the same manner, this inference must be considered as tentative. Residues doser to the unvaried 
section that have repetrtive isolates containing the same amino acid are more informative than residues farther away. 
[0351] In a second round of Diffuse Mutagenesis, we vary the codons shown in Table 206. Residues 1 through 1 0 
are not varied because these provide the best match for the first ten bases of the target. Residues 19. 20. and 21 are 
not varied so that the synthetic oligonucleotides can be annealed. The two-way variations at residues 1 1 through 1 8 and 
23 through 36 all allow the selected amino acid to be present, but also allow an as-yet-untested amino acid to appear. 
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It is desirable to introduce as much variegation as the genetic engineering and selection methods can tolerate without 
risk that the parental DBP sequence will tall below detectable level. Having picked three residues for the homologous 
overlap, we have only 23 amino acids to vary. Thus residue 22 is varied through four possibilities instead of only two. 
Residue 22 was chosen for four-way variegation because it is next to the unvaried residues. We use pEP2004 as the 

5 backbone, and ligate DNA prepared with Klenow fragm nt from oligonucleotides #423 and #424 (Table 206) to the 
BstE II and Bsu36 l sites. The resulting population of plasmids containing the variegated DNA is denoted vg2-pEP2004. 
[0352] . Table 207 shows the amino acid sequence obtained from a hypothetical isolate bearing a DBP gene speci- 
fying a polypeptide with improved affinity for the target. Changes in amino acid sequence are observed at ten positions. 
Comparisons of the sequences from several such isolates as well as those obtained in the first round of mutagenesis 

w can be used to locate residues providing significant DNA-binding energy. 

[0353] Having established some affinity for the target we now seek to optimize binding via a more focused muta- 
genesis procedure. Table 208 shows a third variegation in which twelve residues in the variable region are varied 
through four amino acids in such a way that the previously selected amino acids may occur. Again. pEP2004 is used as 
backbone and synthetic DNA having cohesive ends is prepared from olig#325 and olig#327. The plasmid is denoted 

15 vg3-pEP2004. In subsequence variegation, we would vary other residues through four amino adds at one time. By 
hypothesis, we select the polypeptide shown in Table 209 that has high specific affinity for the first target; now we can: 

a) replace both occurrences of the first target by a second target. La the intact HIV-1 subsequence (1024-1044). 
and 



20 



25 



30 
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b) use the selected polypeptide as the parental DBP to generate a variegated population of polypeptides from 
which we select one or more that bind to the second target 

Because the second target differs from the first in the region thought to be bound by residues 1 through 1 0 of the paren- 
tal DBP. we concentrate our variegation within these residue for the first several rounds of variegation and selection. 
[0354] We replace the target DNA sequence in the neo promoter for tet in pEP2002 with ds DNA comprising syn- 
thetic olig#450 and olig#452 at the Stul/Hindlll site. The new plasmid is named pEP2010 and confers Tc on de!ta4 
cells. 

Second Target and P^eo ^^^^ promotes ifit 

5« |CCT|GCG|AAC|CGG|AAT|TGC|CAG|- 

Olig#450 = 3» gga cgc ttg gcc tta acg gtc- 

1 StuT I i.ri5 L 

|CTG|GGG|CGClCCT|crG|GTA|AGGlTTGlGG- 

gac ccc gcg gga gac cat tec aac cc- 

a 



^ ATA ATA CAG TAg caa ccc tct « HIV 1024-1044 

A|ATa|ATA|sAa|$ag|caa|ccc|tct|A 3«01ig#452 
t taT tat GtC ATC GTT GGG AGA t tcg a 5' 

50 \ Secon d Target I I H4n^? | 



[0355] We replace the target in the ame promoter for oalTK of pEP2010 with synthetic olig#454 and olig#456 
between Agal and )S2al sites. The new plasmid is named pEP201 1 and confers Gal* on HB101 . pEP201 1 contains the 
second target in both selectable genes and is ready tor introduction of a variegated BSSe gene and selection of cetts 
expressing polypeptides that can selectively bind the target DNA subsequence. 
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Second Target and Pamp promotes aalT . K 

5» |CTT|CTA|AAT|ACA|TTC| AAA| 

01ig*454 3' c egg gaa gat tta tgt|aag tttj 

I Apal 1 J — -35 I 



I TAT I GT A 1 TCC I GCT I CAT I G AG I ACA I ATA 1 AC C 1 CT 

ata cat agg cga gta etc tgt tat tgg ga 

I '10 I 



ATA ATA GAG TAg caa CCC tct « HIV 1024-1044 
T|ATa|ATA|£Ag|ifiglcaa|cccltct| CGT 3'01ig#456 
a taT tat GtC ATc GTT GGG AGA gca gat c 5' 
\ Second Target I \ Xfeal I 



[0356] Variegation of the first eleven residues of the potential DNA-binding polypeptide is illustrated in Table 210. 
Double-stranded DNA having appropriate cohesive ends is prepared from o!ig#460 and olig#461. Klencw fragnnent. 
BstE II. and Bsu36 l. This DNA is ligated into similarly digested backbone DNA from pEP201 1; the resulting plasmid is 
denoted vgl -pEP201 1 . Delta4 cells are transformed and selected with fusaric acid and galactose. Table 21 1 shows the 
sequence of a 37 amino-acid polypeptide isolated from cells exhibiting the DBP* phenotypes by the above hypothetical 
selection. The sequence shown in Table 21 1 is hypothetical and is given by way of example. This example must not be 
construed as a prediction that this sequence has specific affinity for the target or any other DNA sequence. Further var- 
iegation (vg2, vg3,...) of this peptide and selection for binding to Target#2 will be needed to obtain a peptide of high spe- 
cificity and affinity for Target#2. 

[0357] We anticipate that Successful DBF production will take more than three or four cycles of variegation and 
selection, perhaps 10 or 15. We anticipate that initial phases will require careful adjustment of the selective agents and 
IPTG because the level of repression afforded by the best polypeptide may be quite low. As stated, we expect that bio- 
physical methods, such as X-ray diffraction or NMR, applied to complexes of DNA and polypeptide will yield important 
indications of how to hasten the forced evolution. 

[0358] The length of the polypeptide in the example may not be optimal; longer or shorter polypeptides may be 
needed. It may be necessary to bias the amino acid composition more toward basic anrtino acids in initial phases to 
obtain some non-specifk: DNA binding. Inclusion of numerous aromatic amino acids (W.F.YH) may be helpful or nec- 
essary. 

[0359] Other strategies to obtain polypeptides that bind sequence-specifically are illustrated in examples 3. 4, and 

5. 

Example 3 

[0360] We present a second example of the application of our selection method applied to the generation of asym- 
metric DBPs. A possible problem with making and using DNA-binding polypeptides, is that the polypeptides may be 
degraded in the cell before they can bind to DNA. That polypeptides can bind to DNA is evident from the information on 
sequenc -specific binding of oligopeptides such as Hoechst 33258. Polypeptides conrposed of the 20 common natural 
amino acids contain alt the needed groups to bind DNA sequence-specifically. These are obtained by an efficient 
method to sort out the sequences that bind to the chosen target from the ones that do not. To overcome the tendency 
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of the cells to degrade polypeptides, we will attach a darr»ain of protein to the variegated polypeptide as a custodian. 
The first example of a custodial domain presented is residues 20-83 of barley chymotrypsin inhibitor 
[0361 1 The strategy is to fuse a polypeptide sequence to a stable protein, assuming that the polypeptide will fold up 
on the stable domain and be relatively more protected from proteases than the free polypeptide would be. H the domain 
is stable enough, then the polypeptide tail will form a make-shift structure on the surface of the stable domain, but when 
the DNA is present, the polypeptide tail will quickly (a few milliseconds) abandon its former protector and bind the DNA. 
The barley chymotrypsin inhibitor (BCI-2) is chosen because it is a very stable domain that does not depend on disulfide 
bonds for stability. We could attach the variegated tail at either end of BCI-2. A preferred order of amino acid residues 
in the chimeric polypeptide is: a) methionine to initiate translation, b) BCi-2 residues 20-S5, c) a two residue linker, d) 
the first ten residues of Arc, and e) twenty-four residues that are varied over two amino acids at each residue. The linker 
consists of G-K. Glycine is chosen to impart flexibility. Lysine is included to provide the potentially important free amino 
group formerly available at the amino terminus of the Arc protein. The first target is the same as the first target of Exam- 
ple 2. 

[0362J Table 300 shows the sequence of a gene encoding the required sequence. The ambiguity of the genetic 
code has been resolved to create restriction sites for enzymes that do not cut pEP1009 outside the rav gene. This gene 
could be synthesized in several ways, including the method illustrated in Table 301 involving ligation of oligonucleotides 
470-479. Plasmid pEP3000 is derived from pEP2004 by replacement of the arc gene with the sequence shown in Table 
300 by any appropriate method. 

[0363] Table 302 illustrates variegated olig#480 and olig#48l that are annealed and introduced Into the CI2-arcf1- 
10) gene between PpuM I and Kpn l to produce the plasmid population vgl-pEP3000. Cells transformed with vgl- 
pEPSOOO are selected witfi fusaric acid and galactose in the presence of IPTQ. Further variegation (vg2. vg3. ...) will 
be required to obtain a polypeptide sequence having acceptably high specificity and affinity for Target#1 . 

Example 4 

[0364] We present a second strategy involving a polypeptide chain attached to a custodial domain. In this strategy, 
the custodial domain contains a DNA-recognizing element that will be exploited to obtain quicker convergence of the 
forced evolution. 

[0365] The three alpha helices of Cro fold on each other. It has not been observed that these helices fokJ by them- 
selves, but no efforts in this direction have been reported. We will attach a variegated segment of 24 residues to residue 
35 of Cro (H35 is the last residue of alpha 3). The target will be picked to contain a good approxinrration to the half Or3 
site at one end but no constraint is placed on the bases corresponding to the dyad-related other half of 0^3. A 
sequence that departs widely from the Or3 sequence is actually preferred, because this discourages selection of a 
novel dimeric molecule. We assunne that alpha-3 forms and binds to the same four or five bases that it binds in Or3 and 
that a polypeptide segment attached to the carboxy terminus of alpha-3 can continue along the major groove. We attach 
24 amino acids of polypeptide Immediately after the last residue of alpha-3. wherein the polypeptide Is chosen: a) to 
have more positive charge ttian negative charge, b) to have beta chain predominate, c) to have some aromatic groups, 
and d) to have some H-bondIng groups, produces a population that is then cloned and host cells are selected for 
expression of a polypeptide that binds preferentially to the target sequence. 

[0366] We first construct a hybrid target sequence (Target #3) containing one Or3 haH-site fused to a portion of the 
final target. This hylxid target DNA subsequence is inserted Into the selectable genes in the same manner as tiie arg 
operator was Inserted in Example 2. We then follow the same procedure to vary the 24 residues; first we vary twenty- 
four residues, using two possible amino acids at each residue. We carry out two or more cycles of such diffuse varie- 
gation. Then we vary 12 residues, using 4 possible amino acids at each residue. We do two or more iterations of tiiis 
process so that all residues are varied at least once. 

[0367] We have now generated one or more DBPs that bind well to one half of tiie final target sequence. Next we 
generate binding to the other half of the final target First we replace both instances of Target #3 with the final target 
sequence, target #4. We then vary the alpha helix 3 and the surface of the hypotiiesized domain formed by helices 1 -3 
to optimize binding to final target sequence. 

[0368] A search of tiie non-variable regions of the HIV-1 genome reveals that bases 624-640 (aAT£tGTAGCAQT- 
GGCG) contain a good match to one half of Or3. as shown in Table 400. As first target of this example, we choose 

TATCCCTAGCAGTGGCG. denoted Target#3, that has one half of Or3 and nine bases from HIV-1. Once a 
sequence Is obtained tiiat binds Target#3, we replace Target#3 by Target#4 = HIV 624-640 and variegate the recogni- 
tion helices taken from Cro. 

[0369] To engineer Target#3 into Pnoo that regulates tet. plasmid pEP2002 is digested with StuI and HiDdlll and the 
purified backbone is ligated to an annealed, equimolar mixture of olig#490 and olig#492. Delta4 cells are transformed 
and selected with Tc; replacement of the a££ operator relieves the repression by Arc. Plasmid DNA from Tc" colonies 
is sequenced to confirm the construction; the construction is called pEP4000. 



54 



EP0 452 413 B1 



Target #3 and P^eo ^^^^ motes tfit 
^ 5' |CCT|GCG|AAC|CGG|AAT|TGC|CAG|- 

Olig#490 = 3' gga cgc ttg gcc tta acg gtc- 

I gtui I I -35 j 

10 



I CTG I GGG I CGC | CCT ] CTG | GTA | AGG | TTG | GG- 
gac ccc gcg gga gac cat tec aac cc- 

aAT etc TAG CAG TGG CG « HIV 624-640 
A|1AT|C£C1TAG1CAG|TGG|CGA 3»01ig#492 

t ata ggg ate gte ace get teg a 5' 

J Target ^^3 | iHind? I 



[0370] We engineer the second instance of the target, in like manner, into Pamp for galT.K, using Agal and Xbal to 
30 digest pEP4000 and olig#494 and oiig#496. HB101 cells (galK ) are transformed and are selected tor ability to grow on 
galactose as sole carbon source. Plasmid DNA from Gar colonies is sequenced in the region of the insert to confirm 
the construction. The plasmid is called pEP4001 . 
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Target «3 and P^p that promotes aalT.K 

5' |CTT|CTA|AAT|ACA|TTC|AAAl 

01ig#494 3» c egg gaa gat tta tgt|aag tttl 

I -?? I 



45 1TAT|GTA|TCC|GCT|CAT|GAG|ACA|ATA|ACC| 

ata cat agg ega gta etc tgt tat tgg 

1 -IP I 



|CTT|IAT|C£C|TAG|CAG|TGG|CG CGT 3«Olig#496 
gaa ata ggg ate gtc ace gc gca gat c 5* 
J T?^°^^ *3 L 



[0371] A gene fragment encoding the first two helices of Cro is shown in Table 401 . Olig#483 and olig#484 are syn- 
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thesized and extended in the standard manner and the DNA is digested with BstEII and Kfint. This DNA is ligated to 
backbone from pEP4001 that has been digested with BstE M and Kenl; the resulting plasmid. denoted pEP4002, con- 
tains the Target#3 subsequence in both selectable genes and is ready for introduction of a variegated pdbe gene 
between Bgll! and KcQl. Table 402 shows a piece of vgDNA prepared to be inserted into the Bglll-KcQl sites of 
pEP4002- Table 403 shows the result of a selection of delta4 cells, transformed with vgl -pEP4002. with fusaric acid and 
galactose in the presence of IPTG. Additional cycles of variegation of residues 36-61 are carried out in such a way that 
the amino acid selected at the previous cycle is included. After several cycles in which 22-24 residues are varied 
through two possible amino acids, we choose 10-13 amino acids and vary them through four possibilities. 
[0372] Once reasonably strong binding to Target#3 is obtained, we replace Target#3 with Target#4 and vary the 
residues in helix 3 (residues 26-35) and, to a lesser extent, helix 2 (residues 16-23). 



Exanyle 5 

[0373] We disclose here a method of engineering a polypeptide extension onto the amino terminus of P22 Arc, a 
natural DBP. so that the novel DBP develops asymmetric DNA-binding specificity for a subsequence found in the HIV- 
1 genome. Others have observed that loss of arms from natural DBPs may cause loss of binding specificity and affinity 
(PAB082a and ELIA85). but none, to our knowledge, have suggested adding arms to natural DBPs in order to enhance 
or alter specificity or affinity The new construction is denoted a "polypeptide extension DBP"; the gene is denoted ged 
and the proteins are denoted Ped. Wild-type Arc fornre dimers and binds to a partially palindromic operator We will gen- 
erate a sequence of DBPs descendent from Arc. Early members of this family will form dimers, but will have sufficient 
binding area such that asymmetric targets will be bound. In final stages of the development proteins that do not dimer- 
izewill be engineered. 

[0374] Table 200 shows the symmetric consensus of left and right halves of the P22 arc operator, aicQ. Table 500a 
shows a schematic representation of the model for binding of Arc to arcO that is supported by genetic and biochemical 
data (VERS87b). Arc is thought to bind B-DNA in such a way that residues 1-10 are extended and the amino terminus 
of each monomer contacts the outer bases of the 21 bp operator (RT Sauer. public talk at MIT 15 September 1987). 
[0375] Arc is preferred because: a) one end of the polypeptide chain is thought to contact the DNA at the exterior 
edge of the operator, and b) Arc is quite small so that genetic engineering is facilitated. P22 Mnt is also a good candi- 
date for this strategy because it is thought that the amino terminal six residues contact the nini operator. mniQ, in sub- 
stantially the same manner as Arc contacts arcO. Mnt has significant (40%) sequence similarity to Arc (VERS87a). Mnt 
forms tetramers in solution and it is thought that the tetramers bind DNA while other forms do not. When the mnl gene 
is progressively deleted from the 3' end to encode truncated proteins, it is observed that proteins lackng K79 and sub- 
sequent residues have lowered affinity for mntO and that proteins lacking Y78 and subsequent residues can not form 
tetramers and do not bind DNA sequence-specif ically (KNIG88). Some truncated Mnt proteins of 77 or fewer residues 
form dimers. but these dimers do not present the DNA-recognizing elements in such a way that DNA can be bound. Arc 
is preferred over Mnt because Arc is smaller and because Arc acts as a dimer. 

[0376] Other natural DBPs that have DNA-recognizing segments thought to interact with DNA in an extended con- 
formation (referred to as arms or tails) and thought to contact the central part of the operator, such as X Cro or X cl 
repressor, are less useful. For these proteins to be lengthened enough to contact DNA outside the original operator, 
several r^dues would be needed to span the space between the central bases contacted by the existing terminal res- 
idues and the exterior edge of the operator. 

[0377] Table 500a illustrates interaction of Arc dimers with arcO; the two "Cs of Arc represent the place, near res- 
idue F10 at which the polypeptide chain ceases to make direct contact with the DNA and folds back on itself to form a 
globular domain, as shown in Table 500b and Table 500c. Which of these alternative possibilities actually occurs has 
not been reported. Our strategy is compatible, with some alterations, with either structure. In Table 500b. each set of 
residues 1 -10 makes contact with a domain connposed of residues 1 1 -57 of the same polypeptide chain; the dimer con- 
tacts are near the carboxy terminus. Table 500c shows an alternative interaction in which residues 1 - 1 0 of one polypep- 
tide chains interact wrth residues 1 1-57 of the other polypeptide chain; the dimer contacts occur shortly after residue 
1 0 The similarity of sequences of Arc and Mnt. the demonstration of function of DNA-recognizing segments transferred 
from Arc to Mnt (RT Sauer. public talk at MIT 15 September 1987 and Knight and Sauer cited in VERS86b). and the 
behavior of Mnt on truncation suggest that Table 500b is the correct general structure for Arc. but the structure dia- 
grammed in Table 500c is also possible. 

[0378] Table 501 shows the four sites at which one of the consensus arc half operators comes within one base of 
matching ten bases (six unambiguous and four having two-fold ambiguity) in the non-variable segments of HIV-1 DNA 
sequence, as listed in Example 1. The symbol "(§>" marks base pairs that vary among different strains of HIV-L 
Because we intend to extend Arc from its amino terminus, we seek subsequences of HIV-1 that: a) match one of the 
are half operators, and b) have non-variable sequences located so that an amino-terminal extension of the Arc protein 
will interact with non-variable DNA. The subsequences 1024-1033 and 4676-4685 meet this requirement while the SLto- 
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sequences at 1040-1049 and 2387-2396 do not. In the case of 1040-1049. theamino-terminal extension would proceed 
in the 3' direction of the strand shown and would reach variable DNA after two base pairs. For 2387-2396, variable 
sequence is reached at once. The subsequence 1024-1033 is preferred over the subsequence 4676-4685 because it 
is much closer to the beginning of transcription of HIV so that binding of a protein at this site will have a much greater 
effect on transcription. In the remainder of this example, positions within the target DNA sequence will be given the 
number of the corresponding base in HIV-1. Base A^qs^ of HIV-1 is aligned with the central base of arcO. 
[0379] HIV 1024-1044 has only three bases in each half that are palindromically related to bases in the other half 
by rotation about base pair 1034: Aio24/Tio44. Aio2g/Tio42. and Gio32/Cio36- The latter two base pairs correspond to 
positions in arcO that are not palindromically related. Five of the six palindromically related bases of arcO correspond 
to non-palindromically related bases in HIV 1024-1044. Thus nodimeric protein derived from Arc is likely to bind HIV 
1016-1046 if symmetriG changes are made only in the residues 1-10 (or in any ether set of residues originally found in 
Arc). Our strategy is to add. in stages, eleven variegated residues at the amino terminus and to select for specific bind- 
ing to a progression of targets, the final target of the progression being bases 101 6-1037 of HIV-1 . Because the region 
of protein-DNA interaction is inaeased beyond that inferred for wild-type Arc-accQ complexes, unfavorable contacts in 
bases aligned with the right half of arcO can be compensated by favorable contacts of the polypeptide extension with 
bases 1016-1023. The penultimate selection Isolates a dimeric protein that binds to the HIV-1 target 1016-1037; the ulti- 
mate selection isolates a protein that does not dimerize and binds to the same target 

[0380] Table 502 shows a progression of target sequences that leads from wild-type arcO to HIV 1016-1037. It is 
emphasized that finding a subsequence of HIV-1 that has high similarity to one half of arcO is not necessary; rather, 
use of this similarity reduces the number of steps needed to change a sequence that is highly similar to arcO into one 
that is highly similar or identical to an HlV-1 subsequence. Reducing the number of steps Is useful, because, for each 
change in target, we must: a) construct plasmids bearing selectable genes that include the target sequence in the pro- 
moter region, b) construct a variegated population of pfid genes, and c) select cells transformed with plasmids carrying 
the variegated population of ged genes for DBF* phenotype. 

[0381] In sections (a), (c). (e). and (g) of Table 502. bases in the targets are in upper case if they match HIV 1016- 
1046 and are underscored if they match the wild-type arcO sequence. 

[0382] We construct a series of plasmids, each plasmid containing one of the target sequences in the promoter 
region of each of the selectable genes. For each target, we variegate the eed gene and select cells for phenotypes 
dependent on functional DBFs. For each target, several rounds of variegation and selection may be required. We antic- 
ipate that a plurality of proteins wrill be obtained from independent isolates by selection for binding to one target. We pick 
the protein that shows the strongest jn vitro binding to short DNA segments containing the target as the parental Red 
to the next round of variegation and selection. Genetic methods, such as generation of point mutations in the eed gene 
or in the target and selection for function or non-function of Fed can be used to determine associations between partic- 
ular bases and particular residues (VERS86b). 

[0383] Once a Fed with specific binding for the target Is obtained, it may be useful to determine a 3D structure of 
the Ped-DNA complex by X-ray diffraction or other suitable means. Such a structure would provide great help in choos- 
ing residues to vary to improve binding to a given target or to an altered target. 

[0384] We initiate development of a polypeptide extension DBF having affinity for HIV 1016-1037 by generating a 
variegated population of Feds and selecting for binding to the first target Table 502a shows the first target which we 
designed to have Identity to arcO in the left haH. but to have a mismatch (arcQ yg. target) at Aioss (which is C in the 
corresponding position in the right half of aj[cO and is palindromically related to a Q in the left half); the rationale is as 
follows. Vershon el aL (VERS87b) report that chemical modification with dimethyl sulfate of the wild-type CO at this 
location interferes mildly with binding of Arc and that this location is strongly protected from modification by dimethylsul- 
fate if Arc is bound to the operator. Thus we expect a mismatch between wild-type arcQ and the first target at Aio38 to 
make wild-type Arc bind poorly. Binding can be restored, however, by favorable contacts to bases 1021-1023 by the 
ami no-terminal extension. 

[0385] An aftornative first target would have Cioas* as does affiQ at the corresponding location, and Aio4i. unlike 
arcO or HIV-1 . Vershon gi aK (VERS87b) report that methylation of the corresponding CO base pair strongly interferes 
with binding of Arc. Thus, changing the base that corresponds to HIV 1041 should have a strong effect on binding of 
Arc to the attemative target. 

[0386] In the first variegation step, we extend Arc by five variegated residues at the amino terminal. Since five res- 
idues can contact no more than three bases in a sequence-specific manner, we limit the extent of the target to those 
bases that correspond to HIV 1021-1044. Inclusion of bases corresponding to HIV 1016-1020 at this initial stage might 
position the target too far downstream from the promoters of the selectable genes to allow strong repression of these 
promoters. Once a Fed displaying binding to bases corresponding to 102M 044 has been Isolated, we can introduce a 
greater length of the HIV-1 sequence Into the left side of the target without concern that the Fed will bind too far down- 
stream from the promoter of the selectabi genes to block transcription. Furthermore, once binding by the amino termi- 
nal extension has been established, we can. in a stepwise manner, remove the right half of accQ from the target. 
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thereby forcing more asymmetric binding to the left half of arcO and the bases upstream of 1024. 
[0387] The first target is engineered into both selectable genes as in Example 2. We use ol(g#50l and olig#502. 
shown in Table 503. to introduce the first target downstream of Pn^o that promotes tet. replacing arcO in pEP2002; the 
resulting plasmid is called pEPSOOO. From pEPSOOO, we use olig#503 and olig#504 to construct pEP5010 in which the 
first target replaces arcQ downstream of Pamp ^^^^ promotes aalX K. 

[0388] Table 502b shows schematically how the amino terminal residues align to the first target; the five residue 
extension is unlikely to contact more than 3 base pairs upstream from base 1024. The alteration in the right half oper- 
ator prevents tight binding unless the additional residues make favorable interactions upstream of 1024. Care is taken 
in designing the rwo instances of the target that *J^e downstream boundaries are different. AAB in P^^ and CGT in 
Pamp- Thus, for the novel DBP to bind specifically to both instances of the target, it must recognize the common 
sequence upstream of base 1024. 

[0389] An initial variegated eed is constructed using olig#605, as shown in Table 504, and comprises: a) a methio- 
nine codon to inrtiate translation, b) five variegated codons that each allow all twenty possible amino acids, and c) the 
Arc sequence from 101 to 157. (Because we are constructing a polypeptide extension at the amino terminus, we have 
added 1 00 to the residue numbers within Arc so that Arc residue 1 is designated 101,) This variegated segment of DNA 
comprises (2^)5 = 2^^ = 3.2 x 10^ different DNA sequences and encodes 20^ = 3.2 x 10® different protein sequences; 
with the given technical capabilities, we can detect each of the possible protein sequences. The 3* terminal 20 bases of 
olig#605 are palindromically related so that each synthetic oligonucleotide primes itself for extension with Klenow 
enzyme. The DNA is then digested with Bsu36l and BstEII and is tigated to the backbone of appropriately digested 
pEP5010 which bears the first target in each selectable gene. Transformed delta4 cells are selected for Fus" Gal" at 
low, medium, and high concentrations of IPTG. the inducer of the lacUYS promoter that regulates Bfid. Because the first 
target is quite similar to arcO. we anticipate that a functional Ped will be Isolated with low-level induction of the pfid gene 
with IPTG. 

[0390] More than one round of variegation and selection may be required to obtain a Ped with sufficient affinity and 
specificity for the first target. Function of a Ped is judged in comparison to the protection afforded by wild-type Arc in 
cells bearing pEP2002. Specifically, strength of Ped binding is measured by the IPTG concentration at which 50% of 
cells survive selection with a constant concentration of galactose or fusaric add. chosen as a standard for this purpose. 
A Ped is deemed acceptable if it can protect cells against the standard concentrations of galactose and fusaric acid, 
administered in separate tests, with an IPTG concentration of 5 x lO "* M. Preferably, a Ped can protect cells against the 
standard concentrations of galactose and fusaric acid, tested separately, with no more than ten times the concentration 
of IPTG needed by pEP2002-bearing cells. Variegation of residues 101 . 102. and others may be needed. We anticipate 
that a plurality of independent functional Peds will be isolated; we discriminate among these by measuring io vjlrQ bind- 
ing to DNA oligonucleotides that contain the target sequence. The amino-acid sequences off different isolates are com- 
pared: residues that always contain only one or a few kinds of amino acids are likely to be involved in sequence-specific 
DNA binding. Table 505 shows a hypothetical isolate. Ped-6, that binds the first target. 

[0391 ] Table 502c shows the changes between the first target and the second target. Three changes are made left 
of center to make the target more like HIV 1016-1042. Only the change Qio3o->C affects a base that is palindromically 
related in arcO . One change is made right of center that makes the target more like HIV 1016-1042, less like arcQ, and 
less palindromically symmetric. Furthermore, the target is shortened on the right by two bases so that selection isolates 
proteins that bind asymmetrically to the left side of the target. Starting with pEP2002, we introduce, in two genetic engi- 
neering steps that use olig#541. olig#542. olig#543 and olig#544 (Table 506). the second target (in place of arcQ) into 
the promoter region of each selectable gene; the resulting plasmid is denoted pEP5020. 

[0392] Table 507 shows a variegated sequence that is ligated into pEP5020 between MEtl and Bsy36l. Variegated 
codons are shown in the same way as in Table 204. 

[03931 Table 502d illustrates that residues 1 00-1 1 0 of Ped-6 contact the bases of the second target that diffa* from 
the first target. Accordingly, residues 1 and 96-99 of Ped are not variegated in the DNA shown in Table 507; rather, res- 
idues 1 00-1 1 0 are each varied through four possibilities, always including the amino acid previously present at that res- 
idue. This generates 4^^ ^rox. 4 x 10® different DNA and protein sequences. Selection of transformed 
cells for Fus" Gal^ and saeening by in vitro DNA binding yields, by hypothesis, a plasmid coding on expression for the 
protein Ped-6-2. illustrated in Table 508. 

[0394] An alternative to the var iegatton shown in TaW e 507 is one in which we vary residues 1 0 1-1 05, 1 08. and 11 0 
through eight possibilities each, yielding 2.0 x 1 0® DNA and protein sequences. These residues, except M101 . are indi- 
cated to be in contact with the operator. Ml 01 has been altered by the attachment of the polypeptide extension and thus 
should be altered. After variegatran of the listed residues and selection, further variegation should include some varie- 
gation of residues 96-1 03 because changes in the listed residues may change the context within whic^i residues 96-103 

contact the DNA. . 
[0395] Mor than one round of variegation and selection may be required to obtain a Ped having sufficient affinity 

and specificity for the second target. 
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[03961 Table 502e shows the changes from the second target to the third, which comprise: a) inclusion of bases 
1018-1020, b) one change to the left of the 21 bp arcO region, c) two changes at the center of the accQ region, d) two 
changes left of center, and e) rennoval of bases 1041 and 1042. All of these changes make the third target less sym- 
metric and mor like HIV 1016-1040. The third target is introduced into each of the selectable genes in the same man- 
ner as the second target The resulting plasmid, obtained in two genetic engineering steps, is denoted pEP5030. Table 
502f shows that residues 96-1 1 0 are all potential sites to alter the specificity and affinity of DBPs derived from Ped-6- 
2. Thus, in Table 510. we illustrate a segment of variegated DNA that comprises 2^° = 10® DNA sequences and encod- 
ing on expression 10^ protein sequences having ten residues varied through two possibilities and five residues through 

£ :u;i;*:-.« -n,-^ r*M/l tu^^ Qr-tCIM nryri DoiQRI or»rl lina4aH intrt riPP^H'^n TrancfrirmoH rio\hid rollc 

lUUI ^U^dlUntHOa. I t IC U'l^irA lO U ICI l <Jiyoo*^Vi mi.* * mMX*—** USUA^^' **• nyvnv*** II t%,\^ f^t - 



JO are selected for Fus" Gal". By hypothesis, we isolate a plasmid, denoted pEP5031. that codes on expression for the 



• ■ _ « I A n ^ _i : 

protein reu-o-^i-o snuwn ur iciutt; ou^7. 



[0397] Table 502g shows the changes between the third and fourth targets. The changes are: a) inclusion of bases 
1016-1017. b) two changes right of center, and c) removal of bases 1038-1040. The initial variegation to be selected 
using the fourth target consists of an extension of six residues at the amino terminus of Ped-6-2-5. shown in Table 51 1 . 

15 In iterative steps of forced evolution of proteins, one should not produce a number of different DNA sequences greater 
than the number of independent transformants that one can obtain (about 10^ with current technology). Because there 
are no residues corresponding to 90-95 in the parental DBP (Ped-6-2-5). the first variegation and selection with the 
fourth target is a non-iterative step and it is permissible to produce 10^° DNA sequences and 6.4 x 10^ protein 
sequences. In subsequent iterative rounds of variegation, the number of variants is, preferably, limited to a fraction, e^ 

20 10%, of the number of independent transformants that can be generated and subjected to selection. A protein, illus- 
trated in Table 512 and denoted Ped-6-2-5-2, is isolated, by hypothesis, through selection of a variegated population of 
transformed cells for Fus'^ Gal". 

[0398] Ped-6-2-5-2 binds specifically to HIV 1016-1037 as a dimer. HIV 1016-1037 has no palindromic symmetry. 
Binding to an asymmetric DNA sequence by a dimeric protein is possible because the Ped-6-2-5-2 dimer has more rec- 

25 ognition elements than wild-type P22 Arc dimer and so can bind even though nearly half of the right half of arcO has 
been removed from the target. Ped-6-2-5-2 is useful as is; nevertheless, obtaining a monomeric protein may have 
advantages, including: a) higher affinity for the target because suboptimal interactions are eliminated, and b) lower 
molecular weight. Obtaining a functional nrxjnomeric Ped is easiest if Arc dimers interact in the manner shown in Table 
500b. We use the following steps to isolate a protein that binds specifically to HIV 1016-1037 as a monomer. 

30 [0399] Ped-6-2-5-2 is the parental DBP from which we derive the monomeric DBP. The route taken from a paiindro- 
mically symmetric arcQ sequence to an asymmetric HIV sequence was designed to select for binding to the left half of 
the original ar£ operator. 

[0400] Proteins that do not dimerize, but that bind specifically to the fourth target can be generated in several ways. 
Because the 3D structure of Arc is still unknown, we can not use Stmcture-Directed Mutagenesis to pick residues to 

35 vary to eliminate dimerization. One way to obtain monomeric proteins is to use diffuse mutagenesis to vary all residues 
from 111 to 157 and select for proteins that can bind the target sequence. Another strategy is to synthesize the ged 
gene in such a way that numerous stop codons are introduced. This causes a population of progressively truncated pro- 
teins to be expressed. Table 513 shows a segment of variegated DNA that spans the Bglll to KbqI sites of the arc gene 
used throughout this example. This segment is synthesized with suitable spacer sequences on the 5' end. The extra T 

40 at the 3' end allows two such chains to prime each other for extension with Klenow enzyme. The ratios of bases in the 
variegated positions are picked so that each varied codon encodes about 35% of polypeptides to terminate at that posi- 
tion. Since we intend to determine how much the protein can be shortened and remain functional, we begin by replacing 
codon 153 with stop. Since 15 residues are varied, only about 0.3 % of chains will continue to stop codon 153 without 
one or more stop codons. All the intermediate length chains will be present in the selection in detectable annount. delta4 

45 cells transformed with pEP5030 containing this vgDNA are selected for Fus^ Gal^. Because each variegated codon 
causes translation termination in about 35% of the genes in the variegated population, shorter coding regions are more 
abundant than longer ones. Thus, the shortest gene that encodes a functional repressor will be the most abundant gene 
selected. Plasmid DNA from a number of independent selected colonies is sequenced. The dimerization properties of 
several functional DBPs are tested in vitro and the sequence of the shortest monomeric protein is retained for use and 

so further study. 

[0401 ] In this manner, we generate a protein that binds monomerically to a DNA sequence that has no palindromic 
symmetry. 



Example 6 

55 ' ' 

[0402] We illustrate here the fusion of two known DNA-binding domains to form a novel DNA-binding protein that 
recognizes an asymmetric target sequence. The progression of targets is the same as shown in Table 502 (Example 
5). The amino-acid sequ nee of the initial DBP is illustrated in Table 600 and comprises the third zinc-finger domain 
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from the product of the Drosoohila kr gene (ROSE86). a short linker, and P22 Arc. The linker consists of three residues 
that are picked to alkw: a) some flexibility between the two domains, and b) introduction of a Kpnl site. The polypeptide 
linker should not allow excessive flexibility because this would reduce the specificity of the DBR 
[0403] The primary set of residues to vary to alter the DNA-binding are marked with asterisks. Those in the zinc 
5 fing&r were picked by reference to the model of Gibson et aL (G1BS88); all residues having outward-directed side 
groups (except those directed upward from the beta strands) were picked. Residues 101-110 (1-10 of Arc) were also 
picked to be in the primary set. Other residues within the Arc sequence may be varied. For each target in the progres- 
sion, we initially choose for variegation residues in the primary set that are most likely to abut that part of the target most 

«i .^^^^ c^0- /ivnm«<slri fr^r iKa 4irci farnaf horiin Kv/ \/otr\nr\n rocirli lac 91 9d P.S ?R snH ?Q ftSrh fhroiioh 

10 all twenty amino acids. After one or more rounds of variegation and selection, other residues in the primary and sec- 
ondary sei ate varied. 

[0404] Other zinc-finger domains, such as those tabulated by Gibson et aL (GIBS88). are potential binding 
domains. Other proteins with known DNA binding, such as 434 Cro. may be used in place of Arc. f^ultiple zinc fingers 
could be added, stepwise, to obtain higher levels of specificity and affinity. 
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TABLES 
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Table 1, continued 



40 45 50 55 60 65 
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Hotes : 

Substitutions occuring at solvent exposed positions in the 

unbound repressor dimer are shovm above the wild type 
sequence . 

Substitutions occuring at internal positions are shown below 

the wild type sequence. 
Subsitutions that produce repressor dimers with normal or 

nearly normal DNA binding affinities are shown in 

parentheses • 



Table 2 



Examples of selections form plasmid uptake and maintenance in £, CQli 


gene 


(alternate designation) 


function 


Annp^ 


(Ap". bla) 


beta- lactamase 


Kan^^ 


(Km". QfiQ) 


aminoglycoside P-transferase 




(Tc*^. tet) 


membrane pump 
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Table 2 (continued) 



Examples of selections form plasmid uptake and maintenance in E. coli 


gene 


(alternate designation) 


function 


Cam" 

colictn immunity 
TrpA* 


(Cm^. cat) 


acetyltransferase 
binds to colicin in vivo 
complementation of trpA 



Tables 



Examples of selections for plasmid uptake and 


maintenance in S. cerevisiae 


gene 


function 


UraS* 


complements uraS auxotroph 


Trpi* 


complements trp1 auxotroph 


Leu2* 


complements Ieu2 auxotroph 


His3* 


complements his3 auxotroph 


Neo" 


resistance toG4l8 
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Table 4 (continued) : Agents for Sel ction 
of DBP Binding in £^ coJ.i and Relevant Genotypes 



1) Deletions are strongly preferred over point mu 

2) Only secA gene need be controlled by DBF. 

3) Mutations in crt) are highly pleotropic? some 
effects seen in cell wall, grp best used in conn 
tion with selections having intracellular action. 

4) Resistance to colicins can arise in several wa 
use of two or more E-colicins discriminates again 
other mechanisms. Because colicins do not replic 
they are preferred over phage for selection. Pha 
are useful to verify selection of cells repressin 
expression of 2IQQ&- 

5) Because colicins do not replicate, they are 
preferred over phage for selection. Phage are us 
to verify selection of cells repressing expressio 
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Table 5: Some Recosunended Pairs of Selectabl 

Binding Marker Genes 

A) Recommended pairs: 



aalT.K tetA | qfl3,TiKPh?5 

arcP oheS i tetA thVA 

l3L£Z tetA { £tL£2i 

dctA I smsh fi^ 

£££ thyft i btuB pyrr 

lamB thyA | tQHA q?^lT.K 

secA & DvrF | £ir gygK 

malE ^ lacZ fusion 

tsx cysK I ftcgp 1^92 

B) Less Preferred pairs: Reason 

tetA araP Both transport related. 

secA & lacZ Both related to lacZ 

malE - lacZ fusion function - 

pyrf "thvA Both related to thymine 

lamB aalT.K Both related to sugar me 

bolism. 

cir tsx Both related to colicin 

ptsM tetA Both transport related 

^onA ptsM Both transport related 

crp ipcZ Both related to sugar me 

bolism 
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Table 6 Promot rs 

A: Correlation between Sequence Homology and Promoter 
Strength (MULLS 4) 



Promoter HQinoloav score LQaJ5Blt2 

T7 Al 74.0 7.40 

T7 A2 73.4 7.20 

X Pj^ 58.6 7.13 

lac UV5 59.2 6.94 

59.2 6.30 

T7 D 63.9 6.30 

63.9 6.00 

TnlO Pout 5^-2 6.71 

TnlO Pin 52.1 6.18 

X Prm 49.7 4.71 

49.7 4.17 



Pamp 5^-^ 



Pneo 58.0 
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Table 6 (continued) Promoters 



B: Sequences of some promoters 



Name 



±1 



1. / rwx 



\» xx ^A. .L^^rkwx ■ 



T7 A2 



GTA TTGACAA CATG AAGTAACATGCAGiaASAIACA AATCG 



X PR 



GTG TTGACTA TTTT ACCTCTGGCGG TTAGAATG GT TGC^ 



lac UV5 GGC TTTACA CTTTA TGCTTCCGGCTCAlAI&aiCTG TGGA 



T7 D 



G CG TTG ACTT GATG GGTCTTTATGTGiaSJSSITTA GGT£ 



TnlO Pout GGGSASiATTOGTA AAGAGAGTCGTGiaA&&IATC GAGl 
TnlO Pin AGG TGGATA CACAT . CTTGTCATATGAI£MAIGGT TTCfi 



X Prm 



TG TTAGATAT TTAT CCCTTGCGGTGAiaJiailTAA CAT& 



amp 



AC ATTCAAAT ATGT ATCCGCTCATGAfiASft&IAAC CCTg 



neo 



GA ATTGCCA GCTGG GGCGCCCTCTGG TAAGGTT GG GAAG 
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Table 7 FUNCTIONAL SUBSTITUTIONS IN HELIX 5 

REPRESSOR 

84 85 86 87 88 89 90 91 
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Tables 
Some Preferred Initial DBPs 
X cl repressor 
XCro 

434 cl repressor 
434 Cro 
P22 Mrrt 
P22 Arc 

P22 cll repressor 
X cll repressor 
XXis 
X Int 

cAMP Receptor Protein from E coH 
Trp Repressor from E eoii 
Kr protein from Drosoohila 
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Table 8 (continued) 
Some Preferred Initial DBFs 
Transaiption Factor IIIA from Xenoous taevis 
Lac Repressor from E coji 
Tet Repressor from Tnl^l 
Mu repressor from phage mu 
Yeast MAT-ai -aipha2 
Polyoma Large T antigen 
SV40 Large T antigen 
Adenovirus E1A 

Human Transcription Factor SP1 (a zinc finger protein) 
Human Transcription Factor AP1 (product of iuQ) 

Table 9. Table 10. and Table 1 1 have been deleted. 
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Table 13. MISSENSE MUTATIONS IN P22 ARC REPRESSOR 

THAT PRODUCE AN ARC" PHENOTYPE 
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TABLE 14 



MISSENSE MUTATIONS AT SOLVENT EXPOSED POSITIONS 
OF THE H-T-H REGIONS OF REPRESSOR PROTEINS 

\ Repressor 



L 

Y(K)T 



(S) 
(N) 



35 



40 



45 



50 



55 



TURN 



HELIX 3 



QESVADKMGMGQSGVGALFNGINA 

* * * * * 

EYL DD DD K 

L V 
S 



* 

S P 
K 
L 



Table 14b \ erg 



F 
K 



K 
R T 



15 



20 



25 



30 



35 



TURN 



GQTKTAKDLGVYQSAINKAIHAGRK 



* 

R H 

E P 



* * * 

D H N 

N L R 

C A 



* * 

N 
T 
Q 



* * 

Q T 
L 
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Table 14C 434 Repressor 

H 
L 
V 
T 
A 

20 25 30 35 

• • • • 

HELIX 2 TUHN HEtlXJi 

QAELAQKVGTTQQSIEQLEN 
* * * * 

A 
H 
L 
S 

R 
P 
K 



70 



T^Pl^' i4d Trp Repressor 
T 

75 80 85 



TURN 



2i£LI2L^ 




s 

D 



M C 
H 
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Table 14, NOTES: 

Positions in wild type repressors believed to contact DNA are 
indicated by a * below the wild type residue* 
Substitutions that greatly decrease repressor binding to DNA 
are shown below the wild type sequence. 

Substitutions that produce repressors with normal or nearly 
normal DNA binding affinities are shbone the wild type 
sequence. 

Substitutions that increase repressor affinity for DNA are 
shown in parentheses above the wild type sequence. 



Table 15: deleted* 
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Table 16: Genetic Code Table 
With Secondary-Structure 
Pr ferences 



Second Base 



Ti TSt 

-L A. ^ 






















T 




c _ 




.ft 












1 F 


b/a 


1 s 


a/b 1 


y 


^ 1 


C 


b 


IT 

1 




1 F 


b/a 


1 s 


a/b 1 


Y 


b i 


c 


b 


|c 

1 




I li 

1 


a/b 


1 s 


a/b 1 


stop 1 


stop 


1^ 

1 




I T, 




1 ^ 

I 


a/b 1 


StOD 1 


W 


b/a 


IG 




1 L 


^b 


1 p 




H 


a/b 1 


R 


b/a 


IT 




1 L 


^b 


1 ^ 




H 


a/b 1 


R 


b/a 


|c 


c 


1 L 




1 p 




Q 


b/a 1 


R 


b/a 


|A 




1 L 


,. fl/J? 


t p 




o 


b/€l 1 




b/^ 






1 I 


b 


1 T 


b 1 


N 


i/b 1 


5 


a/b 


|T 




1 I 


b 


1 T 


b 1 


K 


a/b 1 


S 


a/b 


ic 


A 


1 I 


b 


1 T 


b 1 


K 


a/b 1 


R 


b/a 


|A 




1 M- 




1 T 


b 1 


K 


a/b 1 


R 








1 V 


b 


1 ^ 


a 1 


D 


a/b 1 


G 


b/a 


|T 




1 V 


b 


1 ^ 


a 


D 


a/b 1 


G 


b/a 


|c 


G 


1 V 


b 


1 ^ 


a 1 


£ 


a 1 


G 


b/a 






1 V 


b, 


1 h 




1 £ . 


_fl 1 


Q .. 


b/a 





Amino acids denoted "b" strongly favor extended structures, 
Amino acids denoted "b/a" favor extended structures. 
Amino acids denoted "a/b" strongly favor helical structures. 
Amino acids denoted "a" very strongly favor helices - 
Proline is denoted and favors neither beta sheets nor 
helices. 



b: I, M, V, T, Y, C 

b/a: F, R, G, W 

a/b: L, S, H, N, K, D 

a: A, E 

P 
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Table 1 7 



Fraction of ONA malecules having n non-parentai bases when reagents 
that have fraction M of parental nucleotide. 


Number of bases using mixed reagents is 30. 


M 


.9965 


.97716 


.92612 


.8577 


.79433 


.63096 


to 


.9000 


.5000 


.1000 


.0100 


.0010 


.000001 


f1 


,09499 


• V V V V * 






.00777 




f2 


.00485 


.1188 


2768 


.1197 


.0292 


.000149 


f3 


.00016 


.0259 


.2061 


.1854 


.0705 


.000812 


f4 


.000004 


.00409 


.1110 




.1232 


.003207 


f8 


0. 


2x10'^ 


.00096 


.0336 


.1182 


.080165 


f16 


0. 


0. 


0. 


5x10'^ 


.00006 


.027281 


f23 


0. 


0- 


0. 


0. 


0. 


.0000089 


most 


0 


0 


2 


5 


7 


12 


fn is the fraction of all synthetic DNA molecules having n non-parental 
bases. 

"most" is the value of n having the highest probability. 
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Tetble 18: best vgCodon 

Program "Find Optimum vgCod n." 
IN I TI ALI Z E-MEMOR Y -OF - ABUNDANCES 

DO ( ti « 0.21 to 0.31 in steps of 0.01 ) 

• DO ( cl « 0.13 to 0.23 in steps of 0.01 ) 

• . DO ( al C.23 to 0.33 in steps of 0.01 ) 
Comment calculate gl from other concentrations 

. . • gl « 1. 0 - tl - cl - al 
. . . IF( gl •ge. 0.15 ) 

. . . . DO ( a2 - 0.37 to 0.50 in steps of 0.01 ) 
DO ( c2 « 0.12 to 0.20 in steps of 0.01 

Comment Force D+E « R + K 

g2 = (gl*a2 -.5*al*a2)/(cl+0.5*al) 

Comment Calc t2 from other concentrations. 

t2 » 1. - a2 - c2 - g2 

IF(g2.gt. o.l.and. t2.gt.0.1) 

CALCDLATE-ABUNDANCES 

COMPARE-ABUNDANCES-TO-PREVIOUS-ONES 

end_IF_block 

end_DO_loop ! c2 

end_DO_loop 1 a2 

end_IF_blocK ! if gl big enough 

. • . . end_DO_loop ! al 
. . . end^DO_loop I cl 
. .end_DO_loop ! tl 

WRITE the best distribution and the abundances. 
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Table 19: Abundanc s obtained 
from optimum vgCodon 





Amino 




Amino 






....acid 


Abundance 


acid 


Abundance 


10 


A 


4.80% 


C 


2.86% 






€.GG% 


£ 


6.00% 




F 


2.86% 


G 


6.60% 


15 


H 


3.60% 


I 


2.86% 


K 


5.20% 


L 


6.82% 




M 


2.86% 


N 


5.20% 




P 


2.88% 


Q 


3.60% 


20 


R 


6.82% 


S . 


7.02% n 




T 


4.16% 


V 


6.60% 




W - 


2.86% If as 


i Y 


5.20% 


25 


sXsat 


5.20% 







30 



35 



40 



45 



Ifaa = least-favored amino acid 
mfaa » mo8t**favored amino acid 
ratio « Abun(W)/Abum(S) « 0.4074 



i (l/ratio^ j rratio^J gt<?p-ggQg 

1 2.454 .4074 .9480 

2 6.025 .1660 .8987 

.0676 .8520 

.0275 .8077 



3 
4 



7 




5 89.095 .0112 .7657 

6 



4.57 X 10'^ .7258 
1.86 X 10'^ .6881 



50 



55 



89 



EP0 452 413 B1 



Coioinent 



Table 20: Calculate worst codon. 
Program "Find worst vgCodon within Serr of given distribu- 
tion." 

INITIALI ZE-MEMORY*OF-ABUNDANCES 

READ Serr Conaient Serr is % error level. 

Comment Tll.Cli^ Ali,Gli, T2i,C2i,A2i,G2i, T3i,G3i 
Comment are the intended nt-distribution» 

READ Tli, Cli, Ali, Gli 
READ T2i, C2i, A2i, G2i 
READ T3ip G3i 

Fdwn « l.-Serr 
Fup « l.+Serr 

DO ( tl « Tli*Fdwn to Tli*Fup in 7 steps) 
. DO ( cl = Cli*Fdwn to Cli*Fup in 7 steps) 
. . DO ( al « Ali*Fdwn to Ali*Fup in 7 steps) 
gl =• 1. - tl - cl - al 
IF( (gl-Gli)/Gli .It. -Serr) 
gl too far below Gli, push it back 
. gl - Gli*Fdwn 

. factor - (l,-gl)/(tl + cl + al) 
. tl - tl*factor 
. cl « cl*factor 
. al « al* factor 
. .end_IF_bloc)c 
IF( (gl-Gli)/Gll .gt. Serr) 
gl too far above Gli^ push it bac)c 
. gl « Gli*Fup 

. factor - (l.-gl)/(tl + cl + al) 
. tl - tl*factor 
. cl • cl*factor 
. al =- al*factor 
. .end_IF_bloc)c 

DO ( a2 « A2i*Fdwn to A2i*Fup in 7 steps) 
. DO ( c2 » C2i*Fdwn to C2i*Fup in 7 steps) 
. . DO {g2«G2i*Fdwn to G2i*Fup in 7 steps) 

Calc t2 from other concentrations. 
. . . t2 • 1. - a2 - c2 - g2 
. . . IF( {t2-T2i)/T2i .It. -Serr) 



Comment 



Comment 



90 



EP0 452 413 B1 

Table 20, continued: Calculate worst codon. 
Comment t2 too far below T2i, push it back 
t2 = T2i*Fd%m 

factor - (l.-t2)/(a2 + c2 + g2) 

a2 = a2* factor 

. . ^ , c = i c2 c2*factor 

g2 » g2*factor 

end_IF_bloc)c 

IF( {t2-T2i)/T2i .gt, Serr) 

Comment t2 too far above T2i, push it back 
t2 " T2i*Fup 

factor - (l,-t2)/(a2 + c2 + g2) 

a2 ■« a2*factor 

c2 = c2* factor 

g2 ■« g2* factor 

end_IF_block 

IF(g2.gt. 0.0 .and, t2.gt.0.0) 

t3 " o.5*(l.-serr) 

g3 = 1. - t3 

CALCUIATE-ABUNDANCES 

COMPARE-ABUNDANCES-TO-PREVIOUS-ONES 

. t3 « 0.5 

g3 » 1. - t3 

CALCULATE-ABUNDANCES 

COMPARE-ABUNDANCES-TO-PREVIOUS-ONES 

t3 « 0.5*(1»+Serr) 

g3 - 1. - t3 

CALCULATE-ABUNDANCES 

COMPARE-ABUNDANCES-TO-PREVIOUS-ONES 

end_IF_block 

end_DO_loop I g2 

end_DO_loop ! c2 

. . . , ,end_DO_loop I a2 
. . . .end_DO_loop I al 
. • •end_DO_loop i cl 
. .end_DO_loop I tl 

WRITE the WORST distribution and the abundances. 
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Table 21: Abundances obtained 
using optimuB vgCodon assuming 

5% errors 



Amino Amino 





Abundance 


acid 


Abi^pdance 




4-59% 


w 


^ # / %>% 


D 


5.45% 


£ 


6.02% 


f 


2-49% Ifaa 


G 


6.63% 


H 


3.59% 


I 


2.71% 


K 


5.73% 




6.71% 


M 


3.00% 


K 


5.19% 


P 


3.02% 


Q 


3.97% 




7.68% mfaa 


S 


7.01% 


T 


4.37% 


V 


6.00% 


Vi 


3.05% 


Y 


4.77% 




5.27% 






ratio 


- Abun(F)/Abun(R) 


m 0.3248 





0. 

1 

2 
3 
4 
5 
6 
7 




fratiol j 
.3248 
.1055 
.03425 
.01112 



3.61 X 10"3 
1.17 X 10^^ 
3.81 X 10"^ 



stop-free 
.9473 
.8973 
.8500 
.8052 
.7627 
.7225 
.6844 



Table 22, deleted 
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Tables for Example 1 

Table 100: X 0^3 Downstream 
of Paap that promotes galT.K 

01ig#4 3< ca att ac c caa aaa aat tta tcrtlaaa tttl- 

I HpaT II Anal I J z25 L 



(TAT|GTA|T CC|GCTlCAT|GAG|ACA|ATAlACCl- 

ata cat aaa caa crta etc tcrt tat tqq- 

I -19 I 



|CTT|ATC|ACC|GCAlAGG |GAT|ATC|TAGlAGTlC 3' - 01ig#3 
aaa tag t ag cat tec eta tag ate t 5' 
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Table 101: X 0^3 Downstr am 
of Pneo 'that promotes t;et 



5' i ICCTlGCGlAAClCGGlAATlTGi 



011g#6 3* ggc cac tta ogg tta acq crtc- 



|CTGlGGGlCG gt^OT)CTGlGTAlAGGlTTGl- 
qac ccc acq qaa aac cat tec aac- 

-19 I 



|GGA|TAT|CAC|CGC|AAGlGGAlTA 3» « Olig#5 

qqcy ata qtq qcq t tc cct att eg a 5* 

J iHindllll 
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Table 102: rav gene 
using lacUVS as promoter 

SPBl- BstE II" ( Ba;L I- PDUM I-BqIII-BaBHI-A!aI) 

-KDnl-CTrp teminator) -fifil ; I 

5«-ACTACT CCAGG C TTTACA CTT TATGC TTCCG GCTCG TATAAT GTGT GG 
,1 gPel I 

AAT TGTGA GCGGA TAACA ATTTC ACAC ! lacUVS 

A GGTAACC AGGAGGAAATAAA I fiStEH & Shine-Dalgamo seq. 

I BstE2 I 

meqritlkdyamr 
1 2 3 4 5 6 7 8 9 10 11 12 13 
ATG GAA CAA CGC ATA ACC CTA AAG GAC TAC GCG ATG CGC 

fgqtktakdl 
14 15 16 17 18 19 20 21 22 23 
TTT GGC CAA ACC AAG ACA GCG AAG GAC CTA 
|Bal I I JPPVm I I 

gvyqsainkai 

24 25 26 27 28 29 30 31 32 33 34 
GGG GTG TAT CAG AGC GCG ATT AAC AAG GCC ATC 



hagrlcifltina d 

35 36 37 38 39 40 41 42 43 44 45 46 47 
CAT GCC GGC CGA AAG ATC TTC CTA ACC ATT AAC GCT GAT 

jBol II I 
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Tabid 102, continued 
gsvyaeevkpfps 

48 49 50 51 52 53 54 55 56 57 58 59 60 
GGA TCC GTC TAG GCG GAA GAG GTA AAG CCC TTC COG AGT 



n K K t t a . . . 
61 62 63 64 65 66 67 67 68 
AAC AAA AAA ACA ACA GCG TAA TAG TA GGTACC 

I Kpnl I 

agtcta agcccgc ctaatga gcgggct tttttttt 1 terminator 

GGCCcgactGGCC J ££i I 

1 Sfi I L 
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10 



15 



20 



25 



30 



pEPlOOl 



pEP1002 



PEP1003 



PEP1004 



pEPlOOS 



pEPiooe 



pAA3H with 4.3 kbp deletion of \, QIa I 
site introduced. 

pEPlOOl with fd terminator and £££ I, gf A 
I, X cloning site distal to gsdUL' 

PEP1002 with Px promoter replaced by pBR322 
amp promoter fPamp^ and Or3 upstream of 
ffalT.K ; Pamp and Op3 bounded by Hfia I and 
Xfea X and containing Aoa X cloning site 
between Hoa X and PftSffi. 

PKK175-6 with fPamo. galX^, Ifl terminator, 
Spe I, Sfi I cloning site) from pEPl003 

PEP1004 with Tn5 neo promoter (PflSfl) and 
0^2 bounded by Stu X and QlDd XXX. 

pEPlOOS with BamH I site removed by site- 
specific mutation. 



35 



pEPlOO? 



pEPlOOe with (l^sSmit 
site, trpa terminator). 



yav cloning 



40 



45 



50 



pEPlOOS 



p£P1009 



pEPlOlO 



pEPlOll 



PEP1012 



pEPlOO? with K-terminal part of rav gene. 

pEPlOOS with complete cay gene. 

p£PX009 with Or3 replaced by scrambled 
0^3 sequence* 

pEP1009 with 0^3 sequences replaced with 
the HIV 353-369 Left Symmetrized Target. 

pEP1009 With Or3 sequences replaced with 
the HXV 353-369 Right Symmetrized Target. 



55 
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Table lp3 /eontlnuedl : 



pEPllOO to 
pEP1199 

pEPlZOO to 
PEP1299 

PEP1301 



pEPlOll with tflXL* 



p£P1012 with rav^ . 



pEPllOO with ravL" VF55. 



pEP1302 



pEPllOO with XSa^iT FW58. 



PEP1303 



pSP64 with Tn5 nfto 



PEP1304 



p£P1303 with deletion of Ap resistance 
gene. 



PEP1305 



PEP1304 with rav L" VF55 



PEP1306 



pEP1304 with tftYL^ FWSa 



pEP1307 



pEP1304 with rav 



PEP1400 to 
PEP1499 

pEPlSOO to 
pEP1599 



pEP1200 series plasmids with HIV 353-369 
substituted for Right Synmetrized Targets. 

PEP1400 series plasmids containg 

modified rav ^ genes producing Ravj^ proteins 

that complement the rav ]^* VF55 mutation. 



pEP1600 to 
PEP1699 



pEP1400 series plasmids containg 

modified rav R genes producing R&V]^ proteins 

that complement the rav ^^* FW58 mutation. 



PEP2000 



pEP1009 with rav replaced by arc 



p£P2001 



PEP2000 with operator in Pn£fi/ t:et 



PEP2002 



pEP2001 with operator in P^JSSl, aalT.K 
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03 fc< 



PEP2003 



p£P2004 



vgi-pr;P2O05 



PEP2002 with Targetfl in Pneo . tet , 
PEP2003 with Targetfl in Pasi^/ QS^ULZ 

p£F2004 with vgDNA (variegation #1 of 
polypeptide) « 



vg2-pEP2006 



PEP2004 with vgDHA (variegation 12 of 
polypeptide) • 



vg3-pEP2007 



p£P2004 with vgDKA (variegation I 3 of 
polypeptide) • 



PEP2010 



PEP2011 



vgl-pEP2012 



PEP2002 with Target #2 in Pnsfif tet > 

PEP2010 with Target#2 in Pamc, qalT,K 

pEP20ll with vgDNA (variegation #1 of 
residues 1*10) • 



PEP3000 



p£P4000 



PEP4001 



PEP4002 



vgl-pEP1233 



PEP2004 with Ct2->arefl-10^ in place of aXS 

PEP2002 with TargetfS in ^n&Slt 

PEP4000 with Target|3 in PAfikfi/ QalT,K, 

pEP4001 with ero*>hl2 in place of A2:£. 

pEP4002 with vgDNA (variegation #1 of 
polypeptide segment) • 
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Table 104: Xd t rninator 
and multiple cl ning site 
to Insert after ga^T . K 



5« I CgA i AAGig OT {ggrrjTTTlSCAj sec! TTTiT^ 

01ig#2 - 3' t ttc coa a?a a^a cat caa aaa aaa aaal 

J fd terminator L 



|AOTlAGTlCAG (TGGlCCClGAClTGGlCCGlTTAlAC 3' « Olig#l 
}taa tca|qtc acc aaa eta acc aac aat tag c 5» 

I spei t J sfii 1 1 np^i I 
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Table 105: Mutag nic Primer 
to Remove Bam HI site from pEPlOOS 



|t|plv|llwli| 
I 93| 94| 95| 96| 97 | 98 1 

5* CCiACA|CCC|GTC|CTG|TGG|ATC|- 

3 • tat ggg c; 



tliy|a|g|r|i| 

I 991100 1 1011102 1 103 1 104 1 

1CTG|TAC|GCC|GGA|CGC|ATC|GT 3» pEPlOOS 

Aac ata egg cct gca tag ca 5» 01ig#7 



Bold, upper case bases indicate sites of mutation 



Table 106: 
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Table 107: Synthesis of laEHSS-afitEIl-aalll- 

Kpnl- troa terminator 



51 I OTA I GTC I CAG ( GCTl TTA I CAC I TTT i ATG I CTTI CCG I GCT I - 

01igi9 = 3' aa ate ca a aat ata aaa tac aaa aac cqa- 
I Spel I i "3? I 



/3« - 01ig#8 

|CGTlATAlATG)TGTlGGA ATT|GTG { AGC ) GGA I TAA I C AA I TTT I 
gta tot tac aca c et taa cac tcg ggt att qt^ aaa- 



-19 I 



i lac operator 

t 

Olig 111 - 3'/ 



i 



/3« -Olig #10 



I CAC I ACA I C CT I AAC I CACGACAGA TCT A 



TGC|GGT fACCl- 



ertg tcft eea tta ot eetetct aaa t flgq Cgfl 433" 

I BstEil t i Bqllll I KPnI. I 

Olig #13 > 



3V 



}^ ^,T|<^^l^GC|CC G|CCT|AATlGAG!CGGiGCTlTTTlTTTlTI- 

f ft> a.it too age aaa tta etc occ caa aaa aaa aa- 

I spacer I troA terminator L 

G|GCC|CGAlC 3' = Olig #12 

c egg q 5' 

I Sfil 1 

Table 108: deleted. 



102 



EP0 452 413B1 



Taible 109: Synthesis of 
First s ga nt of rav gene 



Cl AGS I KGG \ TAA j CCA } gga I ggg i 
iBgtEII i 



|ATGtGAA|CAA[CGClATAtACC|CTA|AAG |GAC|TAC|GCG|ATG(CGCl- 



/3' « 01ig#14 

|TTT|GGC|CA A|ACC|AAG|ACAlGCG|AAG | G AC 1 CTA | 
01ig#15 » 2* aa ttc tcrt coc ttc cto aat> 

I Ball I 1 Ppviwi I 

|GGG|GTG|TAT|CAG|AGC|GCG|ATT|AAC|AAG|GCC|ATC| 
ccc cac ata ate tea cac taa tto ttc egg tag- 



I CAT I G CC I GGC I CG A I AAG I ATC I TTC I CTG I 
ata egg c ca act ttc tag aaa aac 5' 

iBqlll I 
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Table 110: Second segnent of rav gene 

|r|kli|fll|t|i|n|a|d| 

I 38| 39| 40| 41| 42| 43| 44 | 45| 46| 47 | 

C|CGAjAAaiATCiTTC|CTA|ACC|ATT|AAC|GCT|GAT| 

I Balll! 



|glslv|y|a|e|e|v|k|p|f|p|s 

I 48| 49| 50| 51| 52| 53 | 54 | 55| 56| 57 | 58 | 59 | 60 
I GGA I TCC I GTC I TAG I GCG I GAA I GAG I GTA I AAG I CCC I TTC I CCG I AGT 
I BanHI I |s1:rand ov«f1«p[ | AvaT | 

|n|k|k|t|t|a|.|.|.| 

I 61| 62| 63| 64| 65| 66| 67 | 67| 68| 

|AAC|AAAiAAA|ACA|ACA|6CG|TAA|TAG|TAG|g1:aicca|gtc|t 3 

I Kpnl I 
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Table 111 : X core sequences 
Used to search HIV-1 



Kin £t al. Consensus-A 5 



Symmetric Consensus-A 5 

3 



Or3A 



5 
3 



0|^3 A/Symm • Consensus . 6 5 

3 

0}^3A/Symm. Consensus. 5 5 

3 



1234567 
CCGCGGG 3 

'^^CGCCC 5 



CCGCGGG 3 
GGCGGCC 5 

CCGCAAG 3 
GGCGTTC 5 

CCGCAGG 3 
GGCGTCC 5 

CCGCCAG 3 
GGCGGTC 5 
7654321 



Kim Consensuses 



Syrnm. Consensus-S 



Or3S 



Op3S/Symm. Cons. 2 



Op3S/Symm . Cons . 3 
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Table 112: Potential target binding sequences 
having subsequenc s matching 
six of seven bases 



I 

CCGCGGG Kim consensus-A 
HIV-1 siibsequence «ACTTTCCGCtGGGGACT 

353 t 

I 

CCGCAGG 0{(3A/ consensus. 6 
HIV-l subsequence -TCTCGj^CGCAGGACTCG 

681 t 

I 

CTTGCGG 0^23 
HIV-l subsequence "TTTGACT&GCGGAGGCT 

760 t 
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Table 113: Pot ntial target binding sequences 
having subsequences natchlng five of seven bases 

Symmetric consensus-S CCGGCGG 

HIV-1 subsequence GACTTTCCGctGGGGAC 

352 [ 

0R3S/consensus . 2 CCTGCCG 

HIV-1 subsequence TTTCCaCTGgGGACTTT 

355 t 

OR3S/8ymm consensus. 3 CTGGCGG 

HIV*1 subsequence TAGCA^GGCG£CCGAA 

630 t 

Symmetric consensus-A CCGCCGG 

KIV**1 suJ^sequence CAGTGgCGCCsGAACAG 

633 \ 

Or3A/symm consensus. 5 CCGCCAG 

HIV'l subsequence CAGTGgCGCC^GAACAG 

633 t 

0R3 A/consensus . 6 CCGCAGG 

HIV-1 subsequence GACTAaCGgAGGCTAGA 

763 t 

symm consensus-S CCGGCGG 

HIV-1 subsecpience GACTAsCGGfiGGCTAGA 

763 t 
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Table 113, continued: Potential target binding sequences 
having subs c[uences matching f iv of seven bases 



OrSA/syvm consensus. 5 CC6CCA6 

HIV^l subsequence GAAG A^q GCCAGTAAAA 



0R3 A/consensus • 6 CCGCAGG 

HIV*1 subsequence ACAGAtcrGCAGGTGATG 

5047 \ 

0R3 A/ consensus • 6 CCGCAGG 

HIV-1 svibaequence TCCTAtqGCAGGAAGAA 

5965 t 
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Table 114: Coding region of rav i_-27 gene 
meqritlXdyamr 

1 2 3 4 5 6 7 8 9 10 11 12 13 

ATG GAa CaA CGC ATa aCC cTa aaQ GaC TaC GCG ATG CGC 

fgRtktakdl 
14 15 16 17 18 19 20 21 22 23 
TTT GGC CGT ACC AAG ACA GCG AAG GAC CTA 

IPPUW I I 

gvHITaiQNai 
24 25 26 27 28 29 30 31 32 33 34 
GGG GTG CAT ATT ACG GCG ATT CAG AAT GCC ATC 

hagKQifltinad 

35 36 37 38 39 40 41 42 43 44 45 46 47 
CAT GCC GGC AAG CAG ATC TTC CTA ACC ATT AAC GCT GAT 



g s V y a e e 

48 49 50 51 52 53 54 

GGA TCC GTC TAC GCG GAA GAG 

[BamHll 



V k p f p s 
55 56 57 58 59 60 
GTA AAG CCC TTC CCG AGT 

lAva I I 



n k k t t a . 

61 62 63 64 65 66 67 67 68 
AAC AAA AAA ACA ACA GCG TAA TAG TA GGTACC 

I Konl I 
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Table 115: gene 



meqritlkdyamr 

1 2 3 4 5 6 7 8 9 10 11 12 13 
ATG GAA CAA CGC ATA ACC CTA AAG GAC TAC GCG ATG CGC 

fgEtktakdl 
14 15 16 17 18 19 20 21 22 23 
TTT GGC GAG ACC AAG ACA GCG AAG GAC CTA 



gvRTL aiRDai 
24 25 26 27 28 29 30 31 32 33 34 
GGG GTG CGT ACT CTT GCG ATT CGT GAT GCC ATC 

KagNHiflttnad 

35 36 37 38 39 40 41 42 43 44 45 46 47 
AAG GCC GGC AAT CAT ATC TTC CTA ACC ATT AAC GCT GAT 



gsvyaeevkpfps 

48 49 50 51 52 53 54 55 56 57 58 59 60 
GGA TCC GTC TAC GCG GAA GAG GTA AAG CCC TTC CCG AGT 



[BamHll 



n k K t t a . 
61 62 63 64 65 66 67 67 68 
AAC AAA AAA ACA ACA GCG TAA TAG TA GGTACC 

I KnnI I 



110 



EP O 452 413 B1 



Tables for Example 2 
[0406] 

^ Table 200 





P22 arc operator 




v22 arc Operator 


5' ATGATAGAAGjClACTCTACTAT 3* 






3' TACTATCTTGIGITGAGATGATA 5' 




consensus of half-sites 


5' ATrrTAGArk|s|myTCTAyyAT 3' 
3' TAyyATCTym|s|krAGATrrTA 5" 


15 


P22 arc left hall operator = ATrrTAGArk 


P22 arc right half operator = myTCTAyyAT 



20 



25 



30 



Table 201 P22 Arc gene 



|iD|]c|9|a|8|)c| 
|1|2|314|5|6| 
GG I TAA I CCT|ATG I AAG|GGT|ATG|TCT| AAA) 

lBg« III 



35 



40 



45 



|n|p|h|f|n|l|r|w|p|r| 

I 7 I 8 I 9 I 10| 111 121 13| 14| 15| 16 | 
I ATG I CCT I CAC I TTT I AAC I CTC I AGG I TGG I CCC I CGG I G 

I BSU3§I I I Xna T 



|e|v|l|d|l|v|r|k|v|a 

I 17| 18| 19| 20| 21| 22| 23| 24 | 25| 26 1 
I AG|GTC|CTT|GAT|CTT|GTT|CGC|AAG|GTT|GCT|- 



50 



|e|e|n|g|r|s|v|n|s|e| 

I 27| 28| 29| 301 3l| 32| 33| 34 | 35| 36| 
|GAG|GAA|AAC|GGT|CGG|TCC|GTT|AAC|TCT|G |- 

I Rsr IT I 



55 
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Table 201, continued 

|i|y|n|r|v|m|e|slf|k| 
I 37| 38| 39| 40| 4l| 42 | 43| 44 | 45| 46| 

ag|att|tat|aat|cgc|gtt!atg!gag!tcg!ttc!aag!- 

iBql III 

|k|e|g|r|i|g|a|.|.|.| 

I 47 I 48 1 49 1 50 1 51 1 52 I 53 I | | | 
|AAA|GAG|GGT|CGT|ATC|GGC|GCA|TAA|TAG|TGA| 

|GGT|ACC| 

I Kpn I I 

Amino acid sequence encoded is identical to wild type P22 
Arc. 

DNA sequence designed for optimal placement of restriction 
sites. 
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Table 202 Synthesis of P22 Arc gene 

5'- G|TAA|CCT|ATG |AAG|GGTlATGlTCTlAAAl- 

3*- aa tac tt c cca tac aaa ttt- 
|BstE III 

/«olig#400 

{;VTG|gCT|CA g|TTT|AAC|CTC| ftgff | Tgg I CCC I CGC | " 

tac gga ata aaa t ^Q aaa tec acc aaa qcc 3'« 

I BSU36I I olig#405 

|GAG|GTC((rra | GAT|CTT |GTT|CGC|AAGlGTTlGCTl-- 
etc caa aaa eta aaa caa acq ttc caa caa^ 

/-olig#401 

|GAG|GAA | AAC|SgT|C<?g|TCg|g TTlAAClTCTlgAgl- 

etc ctt tta cc a acc aaa c aa tta aaa etc- 

\«olig#406 

/«ollg#402 

|ATC|TAT|AA T|gGg|GTT|ATG| gAg|TCg|TTC| hh^i^ 

tag ata tt a acq caa tac CtC flqg flM ttS* 

\-olig#407 

|AAA|CAG|G GT|CGT|ATC|GGClGCAlTAAlTAGlTgAl- 
ttt etc cca qca tag c ca cct att ate act- 

|GGT|AC 3' - Olig#403 
£ 5' - Olig#408 

I Kpn I I 

Number of bases in each oligonucleotide. 

400 « 43 401 = 48 402 »= 42 

403 - 47 405 ^ 50 406 « 49 

407 = 38 408 = 34 



113 



EP0 452 413 B1 



Teibl 203: HIV-1 Subsequences 
tha't are similar to one half of 
the Arc Operator 



Number of 
mismatches 

1234567890 | 0987654321 
arcO -ATrrTAGArk 
HIV-1 subsequence -ATtATAtAATACAGTACCAAC 2 

1019 \ 

1234567890| 0987654321 

arcO -ATrrTAGArk 
HIV-l subsequence »ATAATAfiAGTAGCAACCCTCT 1 

1024 t 



1234567890 | 0987654321 
arcO - myTCTAyyAT 
HIV-1 subsequence -ACAGTAGCAACCCTCTATTgT 

1040 t 



1234567890 | 0987654321 
arcO -ATrrTAGArk 
HIV-1 Silbsequence «ATGATAGgGGGAATTGGAGGT 1 

2387 t 

1234567890| 0987654321 
arcO -ATrrTAGArk 
HIV-1 subsequence »tTGAfiAGAAGAAAAAATAAAA 2 

2624 t 
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Table 204 Synthesis of Potential DBP-1 

vgl for pEP2004 

M K G M S K 
1 2 3 4 5 6 

g t .qt^rCTArGR I TAA I CCT I ATG { AAG I GCT I XTC I TCT I AAA t - 



2 2 2 2 2 2 
M P Q F |I/M|Q/R|D/V|R/I|W/G|D/C| 
7 8 9 10 1 11| 12 1 13 1 14 1 15 1 16 1 

P I ATB I CrG fGwT I AkA I kGG I GrTi 



|-3» » olig#420 
22222 |2 22 

|Q/L|R/T|F/Y|R/C|W/G| V \ Q | I/M| T/I | R/Q | 
I 17| 18| 19| 20| 21| 22| 23| 24 | 25| 26i 
|CwG|AsA|TwT|vGT|kGG|G TG|CAG|ATs|AvC|CrGl 

3' -cc cac gtc taS tRo aYc- 



22 22222222 
I V/I I R/I I F/V 1 D/V I T/I I R/Q | V/I ( D/G | V/I [ P/Q | 
I 27j 28| 29| 30j 31) 32| 33 | 34| 35| 36) 

I rTT I AkA I TwT I GwT I Aye I CrG I rTT I GrT I rTT I CmG 1 
Yaa tMt AWa eWa tR o qYc Yaa cVa Yaa aKc- 



• ■ • 

I TAA I TAG I TGA | AAC | CTC | AGG | CGTGATCC 

ate ac t tta aaa tee aeaetagq -5'»olig#421 

I BP^?^T \ spacer I 
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Table 204, continuad: NOTES 

s - oquim lar C and G r - equia lar A and G 

w « equimolar A and T )c - oquimolar T and G 

y « equimolar T and C m « equimolar A and C 

n n equimolar A, C, G, and T 

There are 2^^ « (approx.) 1.6 x lo'^ DMA and protein sequen 
ces. 

Number of bases in each oligonucleotide « 

420 » 86 421 = 73 
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Table 205: Result f first variegation 



K K G M S K 
1 2 3 4 5 6 
I ATG I AAG I GGT I ATG I TCT I AAA I - 



MPQFMRDIWG 

7 8 9 10 11 12 13 14 IS 16 
|ATG|CCT|CAClTTT|ATG|CGG|GAT|ATA|TGG|GGT|- 



QTYCGVQMTR 

17 18 19 20 21 22 23 24 25 26 
|CAG|ACA|TATlTGT|GGG|GTG|CAG|ATG|ACC|CGG|- 



VIFDIRVGVP 

27 28 29 30 31 32 33 34 35 36 
|GTT|ATA|TTT|GAT|ATClCGC|GTT|GGT|GTT|CCGl 
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Table 206 Synth sis f P t ntlal DBP14 

vg2 for pEP2004 

M K G M S K 



R'-GCgGTACGGlTAAlCCTjXTClAASlGGTlATGlTCTlAAAl- 

M P Q F M|T R|Q D|H I|T W|R G|C 
7 8 9 10 11 12 13 14 15 16 
|ATG|CCT)CAC[TTT|AvG |CrA{rAT|AvTlvGGlkGTl- 



= 3* Olig#424 

I 

4 vlD 

Q|H T|M Y C G M|I Q|R MJT T(M R|C 
17 18 19 20 21 22 23 24 25 26 
I CAW I AaC| TA e| TGC| GGG | rwT) CrG I AvGl AvCl VGT I - 
3* - Q ato acq eec YWa aVe tUc tRa Rca - 
I overlap I 

V|P R|Q F|S DlN I|T R|I V|G G|R V|D P|R 
27 28 29 30 31 32 33 34 35 36 

I kTT I CrG I TyT I rAT j Aye I AkA I GkT I SGT I GWT I CsG I 
Maa aYc aRa Vfca tRa tMt cMa Sea cWa aSc - 

TAA I TAG I TGA I AAC I CTC I AGG I CGTGATCC 
atil: ate a nr ^fca oaa t.cc acactaaq -5»-olig#423 

I BSU36I I spacer I 
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Tabl 206, c ntinu d: NOTES 

s « equiffiolar C and G r « equiaolar A and G 

w - equiittolar A and T k » equimolar T and G 

y « equimolar T and C m » equimolar A and C 

n - ectuimolar A, C, G, and T 

224 sequences « 1.6 x 10*^ sequences (DNA and protein) 

NujQber of bases in each oligonucleotide. 
424 » 78 423 « 81 
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Table 207 Result of second selection 



M K G H S K 
1 2 3 4 5 6 
I ATG I AAG I GGT | ATG | TCT j AAA j 



MPQFMRNIWG 
7 8 9 10 11 12 13 14 15 16 
|ATG|CCT|CAC|TTT|ATG|CGA|AAT|ATT|TGG|GGT|- 



QTYCGDRMTR 
17 18 19 20 21 22 23 24 25 26 
I CAT I ACC I TAC | TG C | GGG | GAT | CGG | ATG | ACC | CGT | 



FHSNIRGRVR 
27 28 29 30 31 32 33 34 35 36 
I TTT I AAT I T CT I AAT I ATC I AGA I GGT I CGT I GTT I CGG I 



|TAA|TAG|TGA| 
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Tabl 208: Third vari gation vg3 for pEP2004 

M K G M S K 
1 2 3 4 5 6 
5<- fj^qTCGCATGG 1 TAA [ CCT i ATG I AAG I GGT i ATG i TCT I AAA I - 
j spacer I BstE II { 

H|K N|D W|S 

M P Q F E|V R T|A I R|P G 
7 8 9 10 11 12 13 14 15 16 
|ATG|CCTi gAClTTTlrvGtCGGinnTlATAlvsGlGGTl- 

QjR Y|C D|I R|Q 

G|E T H|R C G N|V G|E M T R 

17 18 19 20 21 22 23 24 25 26 
lsrG|ACA|vrT|TGTlGG G|rvTlsrG|ATG|ACC[CGCl-oliqi325 

olig#327 3'- g tag tqq qgq* 

I overlap 1 



F|C S|N I]T G|D R|H 

VlG N R|H N P|L R R|H R V PtL 
27 28 29 30 31 32 33 34 35 36 
|kkTlAAT|mrTlAAT|iiiyClCGG|firT|CGT|GTT|CnT| 

MMa tta K Ya e-ta KRq atie SYa aca caa aNa- 



TAA I TAG I TGA | AAC | CTC | AGG | CGACCTGGC 
a^ti ate acH tta qao tCC qctqqaccq -5* 



s eguimolar C and G r » equimolar A and G 

w • equimolar A and T k - eguimolar T and G 

y • equimolar T and C m « eguimolar A and C 

n - eguimolar A, C, G, and T 

4I2 « = 1.6 X 10*^ protein and DMA sequences 
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Teible 209: Polypeptide that 
Binds First Targ t 



»* ^ V c v 

n ^ n o jpw 

1 2 3 4 5 6 
I ATG I AAG I GGT I ATG I T CT I AAA I - 



MPQFVRDIRG 
7 8 9 10 11 12 13 14 15 16 
I ATG| CCT| CAC |TTT I GTG | CGG | GAT | ATA | CGG| GGT | - 



GTHC GIQMTR 
17 18 19 20 21 22 23 24 25 26 
|GGG|ACAlCAT|TGT|GGG|ATT|CAG|ATG|ACC|CGC| 



VNRNPRHRVL 
27 28 29 30 31 32 33 34 35 36 
(ATT(AAT|CGT(AATiCCC|CGG|CAT|CGT|GTT|CTT| 
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Table 210; variegation for Second Target vgl for pEP2011 



5'- f?r;TgcgATGG I TAA ) CCT I ATG I rvG I rnG I GnT I rvG t Ars I BIAS I - 



M|K P|Q Q|H I|M V|I 

Q|L R|I. N|A V|D T|A R D I R G 

7 8 9 10 11 12 13 14 15 16 
|iiivG|CnT|aA «|nwT|rvTlCGGtGATlATAlCGGlGGTl- 



GTHCGll QMTR 
17 18 19 20 21 22 1 23 24 25 26 
|fiGC|ACA|eA e)TGe|CGGlATcl CAG | ATG | ACC | C6C { 
olig#461 - -I'-g^a acq ece tag qtC tflC tqg acq- 



VNRNPRHRVL 
27 28 29 30 31 32 33 34 35 36 
I ATT I AAT I CGT I AAT I CCC I CGC I CAT I CGT I GTT I CTT I 
taa tta aca tta crag acc crta oca caa qaa 



TAA|TAG|TGA(AAC|CTC|AGG|CGACCTGGC -3* 
«t:t ate a nH t:ta aaa fcee gefcggaccg -5' 

I aau36I I spacer I 



K|T 
A|V 



M|T 6jD G\0 H|T SjR KjQ 
M ViA R|M AiV V|A M|K NjH 
0 1 2 3 4 5 6 




F|Y 



/ - olig«460 



I overlap 
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Table 210, continued: NOTES 



6 e qulmolar c and G 
w <s ec[uimolar A and T 
y » equimolar T and C 
n « equimolar A, C, G, and 

224 . X xqI protein 



r - equimolar A and G 
k « equinolar T and G 

m ^ eguiisolar A and C 

T 

and DMA sequences 



Table 211: Polypeptide Selected 
for Binding to Second Target 



M T D M K Q 

0 1 2 3 4 5 6 
I ATG I A CG I A GG I GAT | ATG | AAG | CA6 1 - 



MQNDIRDIRG 
7 8 9 10 11 12 13 14 15 16 
I ATG I CAT I AAC | GAT | ATT | CG6 1 GAT | ATA | CGG | GGT | - 



GTHCGIQMTR 
17 18 19 20 21 22 23 24 25 26 
I GGG| ACA| CAC | TGc | GGG | ATc | CAG | ATG | ACC | CGC | 



VNRNPRHRVL 
27 28 29 30 31 32 33 34 35 36 
I ATT I AAT I CGT I AAT I CGC I CGG I CAT I CGT I GTT I CTT I 
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Tables for Exanple 3 
Table 300: CI2*arc(l*10) gene 



I « 

GG I TAA I CCT j ATG 



I P I e I 1 I V I g 
I 7 I 8 I 9 I 10 1 11 
I CCT I GAG I CTT | GTT | GGT 

I a I k I k 1 V 1 i 
I 17 1 18 1 19 1 20 1 21 
I GCT I AAC I AAA I GTT I ATC 



I a I q I i I i 
I 28| 29| 30| 31 
|GCC|CAA|ATC|ATA 

i 



1 I 

2 I 



k I t 

3 I 4 



e I V 

5 I 6 



3c I s I V I e I e 
12 1 13 1 14 I 15 1 16 
AAA I TCT I GTC | GAG | GAA 



1 I q I d I k I p 
22| 23| 24| 25| 26 
CTG I CAG I GAT| AAA| CCT 
Pgt I I 



V I 1 I p I V I g 
32| 33| 34| 35| 36 
GTA 1 CTT I COG | GTT | GGC 



|t|i|v|tlia|e|ylr|i|d 

I 37| 38| 39| 40| 4l| 42 | 43| 44| 45| 46 
lACT|ATT)GTT|ACC|ATGlGAG|TAT|CGT|ATTlGAC 

I Ng<? II 
I StY II 

|r|v|r|llf|v|d|)c|l|d 
I 47 1 48 1 49 1 50| 51 1 52 1 53 1 54 1 55 1 56 
I CG C I GTT I CGT | CTT | TTT | GTC | GAC | AAA | TTG | GAT 

I A9g I I 

iHin^ III 
I I I 



e 1 

27 I 
GAG I 



(continued on next page) 
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Table 300, continued 

|n|i|a|e|v|p|r|v|grg| 
I 57| 58| 59| 60| 6l| 62 | 63 | 64 | 65 | 66 | 
I AAC I ATT I GCT I GAG I GTC I CCT 1 CCC j GTA I GGT I GGC I 

I PpuM I ! 

|)clmtklghm|s|Xlm|p|q| 

1 67| 68| 69| 70| 7l| 72 1 73| 74| 75| 76| 
|AAA|ATG|AAA|GGT|ATG|TCT|AAG|ATG|CCG|CAA| 



I f I . I • I • I 
I 77| 78j 79| 80| 

1 TTT I T AA I TG A I T AG I GGT I ACC I 

J Kpn I I 



Residue Ml is Inserted so that translation can initiate. 

Residue L2 corresponds to residue LSO of Barley chymotryps 
inhibitor CI- 2. 

Residues G66 and K67 are inserted to allow flexibility 
between CI-2 and the DNA-binding tail. 

Residues 68-77 have the same sequence as the first ten 
residues of P22 Arc. 
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Table 301: Synthesis of 
CI2-arc(l-10) gene 

|in|l|k|t|e|w| 
|1|2|3|4|5|6| 
q . - r. I T&A ! CCT ! A Tf: j OTT } AAG i ACT- i GAA 1 TGG i - 
3'- ga tae aaa ttc %aa ett ace - 
I BstEIll I ftfl I I 

^3' = olig#470 
|p|e|l|v|g|)c|8| v|e|e( 
I 7 I 8 I 9 I 101 111 12| 13| 14| 151 16| 
I COT I GAG I CTTjGTTlGGT I AAA I TCTI CTg I gftg I gftA | - 
qga etc aaa eaa c ea ttt aaa gftg gtC Clit -. 



lai ]c|klv|i|l|q|d|k|p|e| 
I 17| 18| 19| 20| 211 22| 23| 24| 25 | 26| 27 | 
|KCT| AAG|AAAlGTTl ATClCTGlCAG|CAT|AAAlCCTlGAGl- 

caa fctic titt c aa tag aac ate eta ttt qqa etc - 
5" ] I Pst I 1 J. 8?u??I| 

olig#475 

/3' = olig#471 
|a|q|i|i|v| Ijplvjg | 

I 28| 29| 30| 311 32| ^^l 3*1 ^Sj 36 1 

|Gfff;|CAAlAT r|ATAlGTAl CTT I CCS I CTXjgg SLU 

egg ott tag tat cat qaa qqg Cflft CS 3-=. 

I Sea I I t 5' 01ig#476 

|t|i|v|t|m|e|y|r|i|d| 

I 37| 38l 39| 40| 4ll 42| 43| 44 | 45| 46| 
I ACT I ATTI Gn I Arr I ATG | GAS | TAT | CGT | ATT I GAC I - 

toa taa eaa tag tae etc ata oca taa etc 

1 Nco H 5- olig#477 t 

I Stv II 
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Table 301, continued 



J 3' olig#472 
|r|v|r|l| f|v|d|k|l|d| 
I 4?! 48! so! 51 ! 521 53 1 54 j S5| 56 j 
! Pr.riGTTlCGTlCTTl TTt|GTClGAClAAAlTTGlGATl- 

aca caa oca aaa aaa caa eta ttt aac eta - 

I sal I I 



ln|i|a|e|v|p 

1 57 1 58! 59 1 60! 61 1 "1 
I ARC I ATT I err I r,AG I GTC I CCTI 



olig#473 3« ^ 

r I V I g I g I 

63! 64 1 65! 66 1 
CGC I GTA I GG T I GGC I 



<;fca taa eg* fite eao qqa 

I Pra Hit 

i PpuM I I 

I Avail I 



geg cat eea cea 
\ 3« olig#479 

5' olig#478 



|k|D|k|glDis|X|D|p!q| 
I 67} 68| 69| 70| 7l| 72! 73| 74 | 75| 76| 
|AAA|ATGlAA A|GGT|ATGlTCTlAAGlATGlCCGlCAAl- 
ttt tae ttt eca tae aaa tte tac qqc qtt. - 



1 f I . I . I • 1 
I 77 1 78 1 79 1 so! 

|TTT|TAAlTGA|TAG|GGT|AC- 3'« olig#474 
aaa att a ct ate c - 5« 

I Kpn I I 
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Table 301, continu d 

Number of bases 



in ach oligonucleotide 



TO 



15 



20 



25 



30 



35 



40 



45 



olig#470 
olig#471 
olig#472 
olig#473 
olig#474 



• * 



46 
57 
54 
48 
47 



olig#475 
olig#476 
olig#477 
olig«478 
olig#479 



53 

56 
31 
48 
55 



50 



55 
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Table 302: Variegation of Tail on CI-2 vgl for pEP3000 

|e|v|p|r|v|g|g| 
5' I 60| 61| 62| 63| 64 | 65| 66| 

nn^ ate cftTe|GAGlGTC |rrT|rKC|GTAlGGTlGGCl 

I spacer iPpuM I I 

3' 

|k|m|)c|g|ia|s|k|B|p|q| 
I 67 1 68 1 69 1 vol 71] 72 1 73 1 74 1 75| 76 1 
| AftA|ATGl^ AA|GGT|ATGiARClAAGlATGlCCGlCftg|- 

I f |I/M|Q/R|D/V|R/G| G I V I 
77| 78| 79| 80| 8l| 82| 83 | 
| TTr|ATsl rw;|G«TleGAlGGTlGTCl- 
olig«481 3'- f^t eea eaa - 

/ 3* - Olig*480 

IQ/L |R/T|F/1f|R/C|W/G|V/DlQ/R|I/M|T/l|R/Q| 
I 84| 85| 86| 87| 88 | 89\ 90| 9ll 92 | 93| 
XS. wG|AsA|TwT|yGT|)cGG|GwC|CrG|ATslAyC|CrG|- 
r , ttst aWa Jtn i* mcc cWc aye taS tRq qYc- 

I V/I I R/I I F/Y I D/V I T/I I R/Q 1 V/I | D/G | V/I | P/Q | 
I 94| 95| 96| 97| 98| 99 j 100 HOl| 102 | 103 | 
I rrr I AkA I TwT I GwT I Aye I CrG I m I GrT I rTT I CnG 1 

Y^ifl 

I . I . I • I 

1 104 1 105 1 106 I 

I TAA I TCA I TAG I GGT I AC 
act ate c - 5* 

2^^ miA and protein sequences » (approx) 1.6 x 10 . 

Nvunber of bases in each oligonucleotide. 

480 ... 82 481 ... 78 
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Tables f r Example 4 
Table 400 Search for X Cro Half Site in 
HIV non-variabl regions* 

Gene seq id«HIVHXB2CG 

Sequences sought: 

0|^3 -/ consensus 

Consensus 0|^3- hybrid 

forward 5» TAT£A£C 3' 5« TATfiCffT 3* 5' TAT£A£T 3' 
reverse 5' GSTfiAlA 3' 5" AfiGfiAIA 3' 5» AfiTfiAIA 3' 

Match with 0r3- 

matches «TAT£C£T 
HIV subsequence »aATCtCTAGCAGTGGCG 

624 f 

Hatch with Or3/consensus hybrid 

matches -TATCAffC 
HIV subsequence -aATCtCTAGCAGTGGCG 

624 t 

Match with Consensus 

matches =sGfSTSAIA 
HIV subsequence -ACAGATGGCAGGTGATg 

5057 I 
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Table 4 00, c ntinued 

Match with Consensus 

matches "TATfiASC 
HIV subsequence «cATCtCCTATGGCAGGA 



First target : TATfiCSTAGCAGTGGCG 

Second target: aATCtCTAGCAGTGGCG 

624 \ 
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15 



20 



25 



30 



35 



40 



45 



Table 401: \ Cro alpha 1 & 2 & slot for Polypeptide 

|in|elq|r|i|t| 
I 1 I 2 1 3 I 4 I 5 1 6 I 
5' eaa|CGGlAG C|TAA|gOTlATGlGAAlCAAlCGClATAlACCl- 
I spacer | BstE It| 

oli9«483 3'^ 

|l|k|d|y|a|nlr|f|g|q| 

I 7 I 8 I 9 I 10 1 111 12 1 13 1 14 1 15 1 16) 
|OTA|AAG)GAC|TAelGCGlATGlCGClTTTlGGClCAAl 

go tac qcQ aaa cca crtt- 

|Bal I I 

|t|k|t|a|k|d|l|g|v| 

I 17| 18| 191 20| 21| 221 23| 24 | 25| 
1 ACC I AAG I ACA 1 GCC t AAA I G AT I CTC I GGG 1 GTG I 
tag ttc tat caa ttt eta aaa ccc cac- 

iBql Hi 

I Ava I| 

I . I • I • I 

I I I I 

I TAG I TAG I TAG 1 GGT | ACC j AAG | GCG | 
ate ate a te cca tag ttc cac - 5' olig#484 

I Knn I I sapcerj 

Nuaber of bases in each oligonucleotide. 

483 ... 60 484 ... 65 



so 



55 
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Table 402 Variegated Polypeptide to attach to 
Cro Helices 1, 2, & 3 vgi for pEP4002 

k I d I 1 I g I V I 
21| 22| 23| 24| 25| 
cca I aca| acc | caA j GAT j CTC [ GGG \ GTS j - 

I ffPftg^r iBql III 

I Ava II 

ly|q|s|a|i|n|k|a|i|h| 
I 26t 27] 2S| 29| 30| 3l| 32] 33] 34 | 35] 
lTATlCAGlAGCiGCGlATTlAAClAAAiGCG|ATClCACl- 

liM Q|R D|V R|I W|G D|G Q|L R|T F|Y R|C 
I 36| 37| 38| 39| 40| 4l| 42| 43| 44 | 45| 
lATslCrGlGwTiAkAlkGGlGrT|CwGlAflA|TvT|vGT|- 

^ 3' - Olig«486 
W|G V Q I|M T|I RjQ V|l R|I F|Y DjV 
I 46] 47| 48] 49| 50| 51 j 52 | 53 j 54 | 55| 
| kGG|GTG | CAG | AT 3|AyC|CrG[rTT|AkA|TwT|GvT| 
cc cac aac taS yRo qYc Yaa tMt aWa cAa- 

T|l RjQ V|I D|G V|I P|Q 
I 56| 57] 58| 59| 60| 6l| 
I Ay C I CrG I rTT I GrT I rTT I CmG I 

tRg qY<p Yaa gYa Yaa qKg* 

I . I . I . I 

till 

|TAG|TAG|TAG|GGT|ACC|AAG|GC6| 
ate ate ate cca tag ttc coc 5' « olig#488 

I Kon I I sapcerj 



equimolar c and G 
equimolar A and T 
ec[uimolar T and C 
equimolar A, G/ and T 



r ■ equimolar A and G 
k « equinolar T and G 
m - equinolar A and C 
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Table 403 
Result of first vari gation of 
alpha l,2,3:vgPolyPeptide 









1 m 1 e 1 

1 1 i 2 i 

1 1 » 

{ATGjGAAj 


<3 1 

3 i 

CAAj 


r 1 
4 j 

CQC! 


* 1 

I 1 

5 1 


t 1 
6 1 

1 


1 1 

7 1 
CTA| 


k 1 

^ 1 
AA6| 


d 1 

^ 1 
GAC| 


y 1 a 1 m 1 

10| 111 12| 
TAC 1 GCG 1 ATG | 


r 1 
13 1 
CGCj 


^ 1 

14 1 
TTT| 


9 1 
15 1 

GGC| 


q 1 

16 1 

CAA| 


t 1 
17 1 
ACC| 


k 1 

18| 
AAG| 


t 1 
19) 
ACA| 


a 1 k 1 d 1 
20| 21| 22| 
GCC|AAG|GAC| 


1 1 
23 1 
CTA| 


9 1 
24| 

GGC| 


V 1 
25 1 
GTGl 


y 1 

26| 
TAT| 


<I 1 
27 1 

GAG{ 


8 1 

28 1 
AGCj 


a 1 
29 1 
GCG{ 


i 1 n 1 k 1 

30| 31| 32| 
ATT|AAC|AAA{ 


a 

33 1 
GCGI 


1 1 
34 1 
ATC 


h 1 
35; 

CACI 




M 

1 36 
|ATG 


Q 

37 1 

|CAG| 


V 
38 
GTT 


R G D 
39| 40| 41 

|AGA|GGG|GAT 


L 

1 42 
|CTG 


T 

1 43 
|ACA 


Y 
1 44 
|TAT 


C 

1 45| 
|TGT[ 


W 

|TGG 


V 

1 

|GTG 


Q 

1 M B 

1 

|CAG 


I I R 
[ 49 1 50 1 51 
|ATC|ATC|CGG 


V 

1 c o 

1 

|GTT 


R 

1 53 
|AGA 


F 

\ 54 

|TTT 


D 

1 c^i^ 1 
1 1 

|GAT| 


T 

1 

|ACC 


R 
1 57 
|CGG 


V 

1 5^ 
|GTT 


G I Q 
1 59| 60| 61 
|GGT|ATT|CAG 










|TAG 


1 • 
|TAG 


1 • 
|TAG 
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TeJ3les for Example 5 



Table 500: Proposed binding of Arc dimer to arco 



(a) Interaction of residues 1-iO 

with ACCfi 



Arc N >c c< N 

arcO S* ATrrTAGArlcsayTCTAyyAT 
3» TAyyATCTymskrAGATrrTA 



(b) N-terminal residues interacting with same 
polypeptide chain, dimer contacts near C--terminus 



2 C 0 1 

\ / 

/VVV\X/VVV\ 
/ \ 
VVV"\, i/'VVV 
Arc 1 N 1 I M 2 

arcO 5» ATrrTAGArksmyTCTAyyAT 

3 • TAyyATCTyms)crAGATrrTA 
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Table 500, continued 



(c) N-terminal residues interacting with opposite 
polypeptide chain, diner contacts close to residue 10 



Arc 



/vw' vvv\ 

, ,y.y-y.::?Nr.y-y-y-N . 

5» ATrrTAGArksmyTCTAyyAT 
3» TAyyATCTymsKrAGATrrTA 
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Table 501 : Search of HIV-1 isolate HXB2 DNA sequenc 
for sequences related to one half of arcQ 

in arco sequence, upper case letters represent palindromical- 
ly related bases. 

In HIV-1 subsequences, represents a nucleotide found to 
vary among HIV-1 isolates while lover case letters represent 
mismatch to 

HIV-1 1016-1051 is non-variable. 

fi££fi left half - XTrrTAGXrk 

HIV-1 subsequence »§§ATCATTATATAATAcAGTAGCAACCCTCTATTGTGTe 

1024 I 

arse right half « myTCTAyyAT 
HIV-1 subsequence -CAGTAGCAACCCTCTATTgTGTi 

1040 t 

2387-2427 is non-variable - 
arco left half = ATrrTAGArk 

HIV-1 subsequence -eATGATAGgGGGAATTGGAGGTTTTATCAAAG 

2387 t 

4661-4695 is non-variable. 

axcQ left half - ATrrTAGArK 
HIV-1 subsequence -AAGTCAAGGAgTAGTAGAATCTATGAATAA« 

4676 t 
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Table 502: Progression of Targets Leading to 

HIV-1 1016-1037 

) 1016 1024 center 1047 

I I i ^ 

HIV-1 5' ATCXTTATATAATAcAGTAGCAACCCTCTATT 

v^^t- ^Mi^^t, 5* TAGATqATAgAaqcaCtAt aCTaT S g @ 

P22 sequence 5< »»^?aTftTqf'^^Q^'*°g^^^^<^^^*r^^^g^g^^^^ ^' 

3' ifcaaetqTActiATCTtcQrtaAGATqaTAtaaqaqttat 5' 

J SXSQ I 



In target: Upper case indicates that HIV-1 and arsfi 



Lover case indicates a change to match greQ* 
Underscore indicates identity to ^XSQ. 
§ indicates bases that vary between instances 
of target. 

In arcO : underscore indicates DNase I protected. 

lover case indicates not palindromically 
related. 
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Table 502, continu d 
(b) 

Novel DBP = NXXXXX >C C< XXXXXN 

Pirsi: target = a a aTi^e ato ATAaAaacaCtAt aCTaT 9%% 

In the Novel DBPi X represents variegated sequence. 
In the line between Novel DBP and target DNA: 

represents regions where variegated sequence 
produce amino acid sequences that will bind 
specificially* 

N & C are the amino and carboxy ends residues l-lO. 

I or \ represent regions where constant amino acid 
sequence is known to bind DNA. 

# represents regions where constant amino acid 

sequence is believed not to bind DNA. 

$ represents regions where DNA sequence varies 

between different instances of the target. 
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Table 502: Progreasi n of Targets Leading to 

HIV-1 1016-1034 
(continued) 



(n\ 1016 1024 



io2A center 1047 



HIV-1 5* ATCATTATATAATACAGTAGCAACCCTCTATT 

First target 5* TftnATaATAaAaacaCtAtaClaT 

changes ir t t 

fiAnond target 5- TAfiaiAAl&CaGaSa£fcAtCfiI§e 

t 

P22 sequence 5> ''»^'7 a ff ftTff"'^^g*^°g»gTCTActATattctcaat;? 

31 t.ffl ^«<^aT'^^ct ATCTt:eataAGATaaTAtaaqaqttat 

J acs2 L 



Novel DBP 



N XXXXXXXX>C C<XXXXXXXX K 

Mil Ill I $$$$$$$$$$$$ 

P TAcATAATA CAGqcaCIAtCCie € 8 § § § 
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Table 502, continued 



(e> 



1016 1024 center 1047 
HIV-l 5' ATCATTATATAATACAGTAGCAACCCTCTATT 



Second target 5' e§eT2^A^C&GaSfl£^tC£I 



clianc^< 



III I 



t I ! J ! ! 



gATTATAT AATA CAGTAagAACCe « 



P22 sequence 5' ^t-^fT^gfTqaTAGAaacacTCTActATattCtcaata 

3> l^aaetaTAcl iATCTtcataAGATqaTAtaaqaqttat 

J arcs L 



(f) 



Novel DBP 



diffuse variegation 

NXXXXXXXXXXXXX>C C<XXXXXXXXXXXXXN 

I 1SS5$$$$$$S$$$S5S 

a a gAJT ATAJ AAT A CAGT AaCAA CC§§§e§§e§§§ 
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Table 502: Progression of Targets Leading to 

HIV-1 1016-1034 
(continued) 

(g) 1016 1024 center 1047 

HIV-l 5» ATCATTATATAATAcAGTAGCAACCCTCTATT 
Third target 5" i«CAlT&T&IAAIACAGTAflSAACC§§§eee§ 
changes ^ 4 4^ 

Fourth target 5' ATCAlTATMAAIftC&GTAGSA§§e 

P22 sequence 5* A»<-fT»>eATaaTAGAaacacTCTActATattctcaata 

3* t^^aaetaTAct ATgrtcdtoAGAToaTAtaaaaattat 

J argQ L 



Novel DBP 



diffuse variegation 

NXXXX >C C< XXXXN 

Illllllllll |||$$$$$$$$$$S$$S$$ 

ATCAiTAT&iAai&caGTAG£A §§§«§§ i § e § e 8 § 
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Table 503: First Target Downstream 
of Promoters of Select2tble Genes 

First Target downstream 
of Panp ^^^^ promotes qalT.K 

Olig#501 » 3* gga cac t ta acc tta acq atc- 

I StUi 1 I -35 I 



|rTC|GGGlcr .r|rCT|CTG)GTAlAGGlTTGlGGAl- 
aae eec aca aaa a ae eat tec aac cct - 

I -10 I 

1024 

|TAC|ATG|ATA |n&A|GCAlCTAlTAClTATlA 3' - Olig#502 

ata tae t *t: ett eat oat ate ata t tea a 5' 
J ^irst Target | | Hlnd3 I 



144 



EP 0 452 413 B1 



Table 503, continued 

First Target dovnstr em 
of Pneo ^^Bt proBotes £££ 

5* |grT|CTA|AAT|ACA|TrclAAAl- 
Olig|503 3* g caa aaa aat tta tat aaa ttt- 

I Apal I J -3? I 



I TAT| GTA| TCC| GCT| CAT | GAG | ACA | ATA| ACC | CT | - 
ata cat aoa caa ata etc tat tat tag aa- 

I -IP I 

1024 

|TAC|ATG|AT A|GAA|GCA|CTA|TAC|TAT| CGT 3' - Olig«504 

ata tae tat ctt cot cat ata ata — qca qftt C 5' 
J First Taraet |. | Xbfll I 
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Table 504 First variegated 
insert into fisd 

1bIx|x|x|x|x1 

Sr I 1 I 96) 97| 98| 99ll00| 

j cct i cag i cGG i TAA i CCT i ATG i f zkj f zk I f zk| f zk I f 2k I 

|m|k|glin|s|k| 

1 101 1 102 1 103 1 104 1 105 1 106 ] 
|ATG|AAG1GGT1 ATGjTCTl AAA| 

|m|plhlf|nlllr| Center of symmetry for 

1 107 1 108 1 109 1 110 1 111 1 112 11131 | priming, 

I ATG I CCT 1 CAC I TTT I AAC I CTC 1 AGG I cgt I att I aat I acg 1 cct I g- 3 • 

I primer I 

olig#605 f 



Self priming 

CTC|AGG|cgtlatt|\ 

3«- tec gca taa / 

3» end self primes for extension with Klenow enzyme. 

f - (0.26 T, 0-18 C, 0,26 A, 0.30 G) 
z » (0.22 T, 0.16 C, 0.40 h, 0.22 G) 
k - equimolar T and G 

There are (25)5 « 3.2 x 10*^ different DNA sequences encoding 
20^ » 3.2 X 10^ different prptein sequences. 

100 has been added to residue numbers for wild-type Arc. 
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TablA 505: Protein Ped-6 
selected for Binding to First Target 

I n I )c I d I 1 I w I r I 

I 1 I 96 1 97 1 9Bt 99 1 100 1 
I ATG I AAG I GAT I ATT ! TGC ! CGT ! 

imlKigin|s|)c| 
1 101 1 102 1 103 1 104 1 105 1 106 | 
I ATG I AAG t GGT I ATG I TCT I AAA I 

lmlPlh|f|n|llr|wlp|rl 

1 107 1 108 1 109 1 110 1 111 1 112 1 113 1 114 1 115| 116 1 
I ATG I CCT I CAC I TTT I AAC I CTC I AGG I TGG I CCC I CGG I G 

I Hau36i I I ?Qaa II 

le|vlll<i|l|v|r|k|vla| 
1 117 1 118 1 119 1 120 1 121 1 122 1 123 1 124 1 125 1 126 1 
1 AG|GTC|CTT|GAT|CTTlGTTlCGClAAG|GTT|GCT| 

|e|«|n|g|r|s|v|n|s|e| 

1 127 1 128 1 129 1 130 1 131 1 132 1 133 1 134 1 135| 136 1 
|GAG|GAA|AAClGGTlCGG|TCC|GTT|AAClTCT|G | 

I Rsr II I 

I Hpa I I 

|i|y|n|r|v|»|e|8|f|)c| 

1 137 1 138 1 139 1 140 1 141 1 142 1 143 ( 144 1 145 1 146 1 
AG I ATC I TAT I AAT I C6C I GTT I ATG I GAG 1 TCG I TTC I AAG I 

|Bgl III 

(k|e|g|r|i|g|a|.|-l-l 

1 147 1 148 1 149 1 150 1 151 1 152 1 153) | | | 
I AAA I GAG I GGT I CGT I ATC I GGC I GCA I TAA I TAG 1 TGA I 

|GGT|ACC| 
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Table 506: Second Target Downstream 
of Promoters of Selectable Genes 

Second Target downstream 
of Panp ^'^^ promotes galT.K 

5» I CCTI GCG I AAC I CCC j AAT { TGC { CAG } - 

Olig«541 - 3' gga cae tta aec tta acq crte- 

I stui I I -35 I 

I OTG I GGG I Q ^r I PCT I CTfl I GTA I AGg I TTG I GGA I - 
gae CCC g ees aoa gae eat tec aac CCt - 

I -19 I 

1024 

; 

| TAC|ATAlAT^|gAG|GCAl OTA|TCClTl A 3* - 01ig#542 

ata tat t? r. ate cot aat agg a — t tcqa 5' 
J g^nond Target I iJUjlfllL 



Second Target downstreeua 
of Pneo ^^^^ promotes t;et 

5' |rTT|CTA|A^T|APA{TTClAAAl- 

01ig«543 3* e egg gaa aat tta tat aaq ttt- 

I Anal I I -35 1. 

I TAT I CTA I T^ r I GOT | CAT | GAG | ACA | ATA ) ACC I CT I - 

ata cat a aa caa ata etc tOt tat tqq W 

1024 
I 

|TAP|ATA|ATA |r^G|GCAlCTAlTCClTl CGT 3' -011g#544 

ata tat ta f , crte egt gat agg a gca gat g 5' 
J gff^ond Target L I Xfrfll I 
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TeUale 507 : Variegation f r selection with 

Second Target 



R|lc 

i m I k I d I i I w e|g 

j 1 i 36i 97i 9B\ 99il00| 
c. f^|^^«|^f;fi|T&A|eCTlATGlAAAlGATlATCiTC-€lrgAl- 



***** 

M|r K|q 6|d M|i S|r K|q 
v|g t|p h|r f|l n|)c t|p 

1 101 1 102 1 103 1 104 1 105 1 106 1 
|T-kG|inmGlsTT|wTlcjArk|mmAl- 



M|r P|q H|y F|y 

v|g r|l s|p vid n | 1 | r | Center of synaietry 
1 107 1 108 1 109 1 110 1 111 1 112 1 113 1 ^ for priming 

(ricG|cnG|v»T|v^iT|AAC |rTg|ACg|catlattlaatlacqlcctla-3' 

I primer |. 

k « equimolar T and G r - equimolar A and G 

w « equimolar T and A s - equimolar C and G 

m - equimolar A and C y - equimolar T and C 

Approximately 4 x 10^ DHA and protein sequences. 

* indicates sites of one alternative variegation. 
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Tabl 508: Protein Ped-6-2 
Selected for Binding to 5 cond Target 

|«|)c|d|i|wiE| 
I 1 I 96 1 97 1 98 1 99|l0p| 
I ATG I AAG 1 6 AT I Arr I TGG I G AG I 

iRlQlGlMlPjTl 
1 101 1 102 1 103 1 104 1 105 1 106 1 
I AGG I CAG I GGT I ATG t AGG I ACA I 

|M|P|Y|Fln|l|r|w|p|r| 

1 107 1 108 1 109 1 110 I 111 1 112 1 113 1 114 1 115 1 116 | 
I ATG I CCT I TAG | TTT | AAC | CT C | AGG | TGG j CCC 1 CGG | G 

I BSU36I i I Xfflfl II 

|e|v|lld|l|v|r|)c|v|a| 

1 117 1 118 1 119 1 120 1 121 1 122 1 123 { 124 1 125 1 126 | 

I ag|gtc|ctt1gat|ctt|gtt|cgc|aag|gtt|gct1 

I PPUM II 

|e|«|n|g|r|s|v|n|fi|e| 

1 127 1 128 1 129 1 130 1 131 1 132 1 133 1 134 1 135 1 136 1 

|gag|gaa|aac|ggt|cgg|tcc|gtt|aac|tct|g | 

I II I 

I Hpa I \ 

|i|y|n|rlv|ale|s|f|Xl 
1 137 1 138 1 139 1 140 1 141 ] 142 1 143 1 144 1 145 1 146 1 

ag|atc|tat|aat|cgc1gtt|atg|gag1tcg|ttc|aag| 
iPql HI 

|k|e|g|rli|g|a|.l-l.| 

1 147 1 148 1149 1 150 1 151 1 152 1 153 I | | | 

|aaa|gag|ggt|cgt|atc|ggc|cca|taa|tag|tga| 

tGGTlACCl 

I Kpn I I 
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Table 509: Protein Ped-6-2-5 
Selected for Binding to Third Target 

|BiR|D|V|W|H| 
I 1 I 96| 97| 98| 99|100| 
I ATG I AGG I GAT 1 6TT I TGG I CAT I 

j V I R i K i I i T i R i 
1 101 1 102 1 103 1 104 1 105 1 106 | 
I GT6 1 CCG I AAT I ATT I ACC I A6A I 

|v|R|H|L|n|l|r|w|plr| 

1 107 1 108 1 109 1 110 1 111 1 112 1 113 1 114 1 115 1 116 | 
I GTG I COT I CAC I TTG I AAC I CTC I AGO I TGG I CCC I CGG I G 

I Rfiii36i I I Xna II 

|e|vll|d|llv|r|]c|v|a| 
|117|118|119|120|121|122|123|124|125|126| 
I AG|GTC|CTT|GAT|CTT|GTT|CGC|AAG|GTT|GCT| 

I P°uM II 

|e|e|n|g|r|s|v|n|s|el 

1 127 1 128 1 129 1 130 1 ISlj 132 1 133 1 134 1 135| 136 1 
lGAG|GAAlAAC|GGT|CGG|TCC|GTT|AAC|TCr|G | 

■I Rgg II I 

I Hoa I I 

|i|y|n|r|v|iiile|s|C|)t| 
1137|138|139|140|X41|142|143|144|145|146| 

AG 1 ATC I TAT I AAT I CGC I GTT I ATG I GAG I TCG I TTC I AAG I 
|Byl III 

|lc|e|g|r|i|gla|.|.|.l 

1 147 1 148 1 149 1 150 1 151 1 152 1 153 I | | | 
I AAA I GAG I GGT I CGT I ATC I GGC I GCA I TAA I TAG I TGA I 

|6GT|ACC| 
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Table 510: Variegation f Ped-6-2 
for binding to Third Target 

K|r R|h 
in t|i D|n I|v W|l p|l 

j 1 I 96 1 97 1 98 1 99 1 100 1 
j cc t ! c ag ! cGG j T AA \ CCT | ATG | AnA j r AT j rTT j TmG | CnT ] 
I spacer | | BstE II | 

6|6 

V|d Q|r n|d M|i R|t T|r 
1 101 1 102 1 103 1 104 1 105 1 106 | 
I GwT I CrG I r rT | ATr | AsG | AsA | 

M|)c F|c 

e|v P|r Y|h l|v n ] 1 1 r | Center of symmetry 
1 107 1 108 1 109 1 110 1 111 1 112 1 113 1 | for priming 

|rwG|CsT|yAC|Tkk|AAC|CTC|AGG|cgtlatt|aat|acg|cct|g 

I BSU36I I olig«506 \ 



Self priming for extension with Klenov 

5'-. .CTC|AGG|cgt|att| >^ 
3 * - g I tec I gca | taa | / 
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Table 511; Varlegatl n for 
Selection with F urth Target 



|n|X|X|X|X|X|X| 
I 1 I 90| 91| 92| 93| 94| 95| 
I cct I cag i CGG i TAA i CCT i ATG I f 2k I f 2k I f 2)C I f ZX I f Zk I f 2 JC I 



I V I R I H I L I n I 1 1 r I 

1 107 1 108 1 109 1 110 1 111 1 112 1 113 I 

I GTG I CGT I CAC I CTT I AAC I CTC I AGO I cgt I cac I ggc I 



f = (0.26 T, 0.18 C, 0.26 A, 0.30 G) 
Z = (0.22 T, 0.16 C, 0.40 A, 0.22 G) 

k equinolar T and G 

There are (2^)^ - 2^^ - 10^ DHA sequences. 
There are 20^ - 6.4 x lo'' protein sequences. 




I R I D I V I W I H I 
I 96 1 97 1 98 1 99 1 100 1 
I CGG t G AC 1 6TG I TGG I CAC I 
I overlap 1 



|V|R|Nll|T|Rl 
1 101 1 102 1 103 1 104 1 105 1 106 1 
1 6TG I CGG I AAT I ATT I ACG I CGA I 



I BSU36T I spacer 
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Table 512: Prot in Ped-6-2-5-2 
Selected for Binding to Fourth Target 



V 
101 
|GTG 



I V I R I H I L I n 
1 107 1 108 1 109 1 110 I 111 
I GTG I CGT I CAC I CTT I AAC 



i 



I e I V I 1 1 d I 1 
1 117 1 118 I 119 1 120 1 121 
I AG I GTC I CTT) GAT 1 CTT 
I PPUM II 



I e I e I n I g I r 
1127 1 128 1 129 1 130 1 131 
|GAG|GAA|AAC|GGT|CGG 



m I R 
1 I 90 

ATG! CGT 

R I D 
96| 97 
CGG I GAT 



R. 1 N 
102 1 103 
CGG I AAT 



1 I r I W 
112)1131114 
CTC|AGG|TGG 



V I r 

122 1 123 
GTT I CGC 



s I V 

132 1 133 

tccIgtt 

I Rsr II I 



T 
91 

ACG 



V 
98 
GTT 

I 
104 
ATT 



k 

124 
AAG 



n 
134 
AAC 



G I P 
92| 93 



I X 1 r 



W I H 
99|100 
TGG|CAC 

T I R 
105 1 106 
ACG I CGA 

P I r 
ll&tlL6 

CCCjCGG 

I Xma 



V I a 
125 1 126 

gtt| gct 



s I e 
135)136 
TCT|G 



C I Q I 

94| 95| 

TGT j CAG j 



|i|y|n|r|v|m|e|8|f|lt| 
1 137 1 138 1 139 1 140 1 141 1 142 1 143 ) 144 1 145 1 146 1 
AG I ATC I TAT | AAT | CGC | GTT | ATG | GAG | TCG | TTC | AAG | 

iBqX III 
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Table 512, continued 
' Ik|e|g|r|i|g|a|.I.l.| 

|147|148|1491150|151|1521153| 1 | | 
I AAA I GAG I GGT I CGT I ATC I GGC I GCA I TAA I TAG I TGA I 

in 

iGGTlACCl 
I I ' 

I Kpn I I 

15 



20 



25 



30 



35 



40 



45 



50 
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Table 513: Variegati n 
of Length of Ped-6-2-5-2 

N|k V|g M|)c 

i y|. yl. r|. 11- 1|. 

1 137 1 138 1 139 I 140 1 141 | 142 | 
5 ' = cgacctagcAG j ATC \Tkm^\ vjKv^ i y iGA j kikjA j W3W4G j 

F|c 

e|. s|. 1|. )c|. 
j 143 1 144 1 145 I 146 1 
I Jt3 AG I TmiG | TlC2in2 1 W2AG | - 



I|JC 

k|. e|. g|. r|. 1\ . g|.| . I • I 
I 147 1 148 i 149 1 150| 151 | 152] 153 [ | 
I W2 AA I k3 AG I k3GA | YiGA | W3W4 A | k3GA | TAG | TGA | - 



|GGT|ACC|t- 3' 
I Kpn I I 







0.65 


T 


and 


0.35 


A 


Yl 




0.65 


C 


and 


0.35 


T 






0.42 


G 


and 


0.58 


T 


^2 




0.42 


T 


and 


0.58 


G 






0.65 


G 


and 


0.35 


T 






0.65 


C 


and 


0.35 


A 






0.42 


C 


and 


0.58 


A 


^2 




0.65 


A 


and 


0.35 


T 


V3 




0.42 


A 


and 


0.58 


T 


W4 




0.42 


T 


and 


0.58 


A 



Each variegated residue produces about 35% stop codons. 

Because (0.65)^^ » 0.003, only 0.3 % of variegated genes 
encode a protein shortened by one residue. 
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in 



15 



20 



25 



30 



35 



45 



50 



Table for Example 6 
Table 600: Third finger domain of Igc -tgs- P22 arc 



|m|e|)clp|y|hl 
1 1 I 2 I 3 1 4 1 5 I 6 1 
!AGG!AGGjTAA!CCT|ATG|GAGiAAAieCGjTATiCACl 

iBfftS III 



|£|s|hl£|d|r|q|£|vlq| 
1 7 1 8 I 9 I 10) 111 12| 13| 14| 15| 16\ 
I TGC I TCA I CAC | TGT | GAT | CGT | CAG ] TTT | GTC | CAA 1 - 
I Dra III 



* i * *. * * * 

|v|a|n|l|r|r|H|l|r|v|H| 
17j 18| 19| 20| 21| 22| 23 | 24 | 25| 2S\ 27| 
GTG I GCC I AAC| TTA I AGA| CGT 1 CAT I CTA| CGC 1 GTG 1 CAC I - 

I B^i i| \hti II I Aft t III mm I I 

I APflL II 



<- linker >\< P22 arc 

********* 

|t|g|t|g|s|m|k|9|a| 

40 I 28l 29| 30| 31| 32 1 101 1 102 1 103 1 104 | 

|ACT|GGT|ACC|GGG|TCT(ATG|AAA|GGC|ATG| 

I Kpn I I 



****** 

l8|)c|B|p|q|f|n|l|r|w| 

1 105 1 106 1 107 1 108 1 109 1 110 1 111 1 112 1 113 1 114 | 
I TCT I AAG I ATG I CCG I CAA I TTC I AAC I CTT I AGG I TGG I 



55 
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Tabl 600, continued 



|p|r|elv|l|d|l|vlr|kl 

1 115 1 116 1 117 1 118 1 119 1 120 1 121 1 122 1 123 1 124 | 
I CCC I CGG I GAG I GTC I CTT I GAT I TTG I GTT I CGC I AAA I 

I Xma 1 1 



|vja|eje|n|g|r|8|v|n|s| 

1 125 1 126 1 127 1 128 1 129 1 130 1 13l| 132 1 133 1 134 1 135 1 
I GTC I GCT I GAA I GAG I AAT I GGC I CGG I TCC I GTG I AAT I TCT I 

I Ken 6321 | Rsr " | lEgOR 1 1 



|e|ily|n|r|v|n|e|8| 

1 136 1 137 1 138 1 139 1 140 1 141 1 142 1 143 1 144 I 
I GAG I ATC I TAT I AAT I CGT I GTT I ATG I GAA I AGC I 

| Bal II) 



l£|k|k|e|g|r|i|g|a| . | 

1 145 1 146 1 147 1 148 ) 149 1 150 1 151 1 152 1 153 1 154 | 
I TTC I AAG I AAG I GAA I GGT I CGC I ATT I GGT I GCA I TAA I 



1 155 1 156 1 
|TAG|TGA|GGA|TTC| 

iHindllll 



* indicates residues of zinc finger domain thought to 
contact DMA in model of Gibson fit aLl 

* indicates residues of zinc finger domain, linJcer, and 
Arc that may influence DKA binding. 
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Claims 

1 A selection vector for selecting recipient cells transformed by such vector that express a protein or polypeptide that 
binds specifically to a predetermined target DNA sequence borne by said vector, which compnses afirst and a sec- 
ond operon each comprising at least one expressible gene, the genes of said first and second operon being drffer- 
ent a copy of the target DNA sequence being included in each operon and positioned therein so that the recipient 
cells enjoy a selective advantage, other than resistance to lytic growth of phage, if they express a protein or 
polypeptide which binds to said copies of the target DNA sequence. 

2 The vector of claim 1 wherein at least one operon comprises a selectable beneficial gene, an occludible promoter 
operably linked to said beneficial gene and directing its transcription, an occluding promoter occluding transcription 
from said occludible promoter of said beneficial gene, and a copy of the target DNA sequence position^ so that 
the binding of said protein or polypeptide to said copy represses said occluding promoter and thereby facilitates 
transcription of said beneficial gene. 

3, The selection vector of either of claims 1 or 2 which comprises: 

a) a first operon. which operon comprises: 

i) a first binding marker gene or genes, 

ii) a first promoter controlling expression of said binding marker gene or genes, and 

iii) a first copy of the target DNA sequence, where said target DNA sequence interferes substantially with 
expression of the first gene(s) if a protein expressed by the recipient cell binds to the target DNA 
sequence, 

b) a second operon, which operon comprises: 

i) a second binding marker gene or genes. 

ii) a second promoter controlling expression of said second binding marker gene or genes, and 
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lit) a secohci copy of the target DNA sequence. 

where said target DNA sequence interferes substantially with expression of the second gene(s) if a protein 
expressed by the recipient cell binds to the target DNA sequence, 

where the binding marker genes of said first and second operons are different, and where, when said 
transformed cells are exposed to fonvard selection conditions the gene products of said first and second 
binding marker genes are deleterious or lethal to the recipient cell. 

4. The vector of claim 3 in which at least one of the operons confers a genotype selected from the group consisting of 
gaiLK*. ie^A^ \3SZ\ oheS r arqP^ . thvA \ Eyif elsM*. SfiCA* /rnalE* /l3£^^ qwqA*. ^\ !a^nB^ ^2nA^ 
cir*= tSi*. aroP '". cysK^ . and dctA" ^. 

5. The vector of claim 3 wherein the binding marker genes are functionally unrelated. 

6. The vector of claim 5 wherein the first and second operons confer, respectively, a pair of genotypes selected from 
the group consisting of: 



(a) galXK* and tetA*; 

(b) argP* and pheS' ^: 

(c) lacZ* and tetA*; 

(d) dctA* and cysK*; 

(e) crp* and t_hyA*; 

(f) lamB ^ and thyA ^: 

(g) SecA ^ / malE^ /t^* and pyrE* ; 

(h) tsx* and cysK* : 

(i) dctA* and thyA* ; 
0) oalTK * and pheS* : 
(k) tetA^ and thvA ^ ; 
(1) ptsM ^ and tb^"-; 
(m) ompA "^ and pvrF^ ; 
(n) btuB " and pvrF" ^ ; 
(0) tonA^ and gallK* : 
(p) Qir^ and cvsK ^ ; and 
(q) arfiP* and lac^"". 

7. The vector of any of claims 3-6 wherein the promoters of said first and second operons are different 

8. The vector of claim 7 wherein the degree of homology between the first and second pronrK3ters is less than 50% in 
the region between the -10 region of the promoter and the base at which transcription is initiated. 

9. The vector of any of claims 3-8 wherein at least one of said operons comprises a plurality of copies of the target 
DNA sequences, wherein each copy is positioned so that the target DNA sequence interferes siiDStantially with 
expression if and only if a protein expressed by the recipient cell binds to the target DNA sequence. 

10. The vector of any of claims 3-9 further comprising a plurality of genetic elements essential to the maintenance of 
the vector or the survival of the transformed cells under conditions that select for presence of said vector, said oper- 
ons and said genetic elements being positkjned on said vector so no single deletion event can render nonfunctional 
more than one of said operons without also rendering nonfunctional one of sakl essential genetic elements. 

1 1 . The vector of daim 1 0 wherein at least one of said genetic elements comprises a selectaWy beneficial or essential 
gene, and a control promoter operably linked to said beneficial or conditionally essential gene, but where no 
Instance of said target DNA sequence is associated with said genetic element 

12. The vector of claim 1 1 wherein the control promoter is essentially identical to the promoter of one of said selectable 
' binding marker operons. so that proteins binding to the latter promoter will also bind to the control promoter and 

thereby inhibit expression of said beneficial or essential gene. 

13. The vector of any of claims 3-12 wherein under reverse selection conditions the gene products of said binding 
marker genes are beneficial or conditionally essential to the transformed cells. 



165 



EP 0 452 413 B1 

14. The vector-of Claim 13 wherein each of the first and second operons confers a phenotype selected independently 
but non-identically from the group consisting of: galT.K^ . IfilA^ . lacZ^ . etlfiS* . atflE^ . thyA" . cm" . mL* • DlsM' 

, sficA"" /mal£* Aaci* . oioaA* . btuB" . lainB* . tooA" . £i£^tsx^ aioE" . cyaK* . and ddA" 

15. The vector of any of claims 3-14 further comprising a nondeleterious cloning site so positioned that insertion of a 
foreign gene at such site does not inactivate said first or second operons or any genetic element of said vector 
required for its maintenance within the transformed cell. 

1 6. The selection vector of any of claims 3-1 5 in which a target sequence associated with at least one of said operons 
is positioned within the RNA-po!ymerass binding site of the promoter of the operon. 

1 7. The selection vector of any of claims 3-1 6 in which a target sequence associated with at least one of said operons 
is positioned upstream of the -35 region of the promoter of the operon. 

18. The selection vector of any of claims 3-1 7 in which a target sequence associated with at least one of said operons 
in positioned downstream of the -10 region of the promoter of the operon. 

1 9. The selection vector of any of claims 3-18 in which a target sequence is positioned so that the most 5' base of the 
target sequence is transcribed into the +1 base or the +2 base of the mRNA transcribed under the direction of the 
promoter of the operon. 

20. The selection vector of claim 10 in which one of said genetic elements is the origin of replication of said vector. 

21 . The vector of any of claims 3-20 further comprising a gene (pdbp) coding on expression for a potential DNA-binding 
protein or polypeptide, said gene comprising: 

a) a coding region that codes on expression for a polypeptide, each domain of said polypeptide having at least 
50% sequence identity to a known DNA-binding domain, and 

b) a promoter operaWy linked to said coding region for controlling its expression. 

22. The vector of any of claims 3-20 wherein said first or second promoter is an inducible or repressibie promoter. 

23. The vector of any of claims 3-22 wherein the target DNA sequence comprises 10-25 base pairs. 

24. The vector of any of claims 3-22 wherein no copy of the target DNA sequence occurs naturally in said first promoter, 
said second promoter, noncoding regions of said first binding marker gene or noncoding regions of said second 
binding marker gene. 

25. The vector of claim 1 wherein the selective advantage is the ability to better utilize a particular nutrient or reduced 
dependency on a particular nutrient. 

26. The vector of claim 1 wherein the selective advantage ts resistance to a substance othenwse toxic to the recipient 
cell. 

27. The vector of claim 26 wherein the selective advantage is resistance to an antibiotic. 

28. A population of vectors according to daim 21 randomly mutated to potentially encode any of 2 to 20 predetermined 
amino acids, at predetermined codons within the pdbp gene so that said vectors collectively can express a plurality 
of different but sec^ence-relaled potential DNA-binding proteins. 

29. The population of claim 28 wherein the level of random mutation is such that from 10* to 10® different potential 
DNA-binding proteins can be expressed. 

30. A cell culture conprising a plurality of cells, each cell bearing a selection vector according to any of claims 1-27. 
said cell bearing a gene coding on expression for a potential DNA-binding protein, where said cells collectively can 
express a plurality of different but sequence-related potential DNA-binding proteins. 

31 . A method of obtaining a gen coding on expression for a novel DNA-binding protein or polypeptide that specifically 
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binds a predetermined DNA target sequence in double strarxied DNA. comprising: 

(a) providing a cell cuHure according to claim 30; 

(b) causing the cells of such culture to express said potential DNA-binding proteins or polypeptides; 

(c) exposing the cells to forward selection conditions to select for cells which express a protein or polypeptide 
which specifically binds to a said target DNA sequence; and 

(d) recovering the selected cells bearing a gene coding on expression for such protein or polypeptide. 

32. A method of producing a DNA-binding protein or polypeptide which specifically binds a predetermined double 
stranded DNA target, which conrprises obtaining a gene by the m.ethod of eiaim 31 which codes on expression for 
such protein or polypeptide, expressing the gene in a suitable host cell, and recovering said protein or polypeptide. 

33. A method of obtaining a protein or polypeptide which may be used to specifically repress a coding or regulatory 
element of interest which comprises obtaining a gene by the method of claim 31. determining the sequence of at 
least the DNA-binding domain of the protein or polypeptide, and producing at least the DNA-binding domain of the 
protein or polypeptide. 

34. The method of claim 31 wherein a gene encoding a known DNA binding protein picked from the group consisting 
of Cro from phage X. cl repressor from phage X, Cro from phage 434. cl repressor from phage 434. P22 repressor 
E CQii tryptophan repressor. E^cqH CAP. P22 Arc. P22 WInl. E, Sfiii lactose repressor. MAT-a1-alpha2 from yeast, 
Polyoma Large T antigen. SV40 Large T antigen, Adenovirus El A. and TFIIIA from Xgngpug lagviS is randomly 
mutated to potentially encode any of 2 to 20 predetermined amino adds, at predetermined codons, to obtain genes 
coding on expression for a plurality of potential target DNA-binding proteins. 

35. The method of claim 31 in which the DNA binding protein comprises a plurality of zinc finger DNA-binding domains. 

36. The method of claim 31 in which the cells prior to transformation are of a Gal E'. GalT, GalK". Tet® phenotype. the 
binding marker genes are the let and galTK genes, and the forward selection condition is cultivation of the ceils in 
a medium containing galactose, or fusaric acid, or both, or substances metabolized into or catalyzing the produc- 
tion of galactose or fusaric add. 

37. The method of claim 31 wherein said pdbp gene is randomly mutated to potentially encode any of 2 to 20 prede- 
termined amino adds, at predetermined codons. such that at least one randomly mutated ccdon of said gene 
genetically encodes all twenty amino acids and yields at that codon a ratio of abundance of least favored amino 
acid to most favored amino acid which is greater than that obtained with an NNN codon. N denoting an equimolar 
mixtureof G. A. TandC. 

38. The method of daim 37 wherein for said codon. the frequences with which acidic and basic amino adds are 
encoded are equal. 

39. The method of claim 37 wherein said randomly mutated codon has substantially the following base proportions: 





1 


C 


A 


G 


base #1 


0.26 


0.18 


0-26 


0.30 


base #2 


0-22 


0.16 


0.40 


0.22 


t>ase #3 


0.5 


0-0 


0.0 


0.5. 



Paterttanspruche 

1 . Selektionsveklor zur Selektion von Empfangerzelten. die mit einem solchen Vektor transformiert sind. wetehe ein 
Protein Oder Potypeptid exprimieren. das spezif isch an eine vorbestimmte Ziel-DNA-Sequenz bindet. die von dem 
Vektor getragen wird. umfassend ein erstes und zweites Operon, wobei jedes mindestens ein exprimierbares Gen 
umfaBt, wobei die Gene des ersten und zweiten Operons unterschiedlich sind. wobei eine Kopie der Ziet-DNA- 
Sequenz in jedem Operon enthalten ist und darin so positioniert ist. daB die Empfdngerzellen einen Selektionsvor- 
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teil haben. der nicht die Resistenz gegenuber lytischem Wachstum von Phagen ist. wenn Sie ein Protein oder Poly- 
peptid exprimieren. das an die Kopien der Ziel-DNA-Sequenz bindet. 

2. Vektor nach Anspruch 1, wobei mindestens ein Operon ein selektierbares nutzliches Gen. einen okWudierbaren 
Promotor. der funkttonell mit dem nutzlichen Gen verknupft ist und seine Transkription lenkt. einen okWudierenden 
Promoton der die Transkription von dem okWixjierbaren Promotor des nutzlichen Gens okkludiert. und eine Kopie 
der Ziel-DNA-Sequenz umfa3l. die so positioniert ist, daR die Bindung des Proteins Oder Polypeptids an die Kopie 
den okWudierenden Promotor reprimiert und so die Transkription des nutzlichen Gens erleichtert. 

3. Selektionsvektor nach Anspruch 1 oder 2, umfassend: 

a) ein erstes Operon. wobei das Operon umfaRt: 

i) ein erstes Bindungsmarkergen oder-gene, 

ii) einen ersten Promotor. der die Expression des (der) Bindungsmarkergens oder -gene steuert. und 

iii) eine erste Kopie der Ziel-DNA-Sequenz. wobei die Ziel-DNA-Sequenz die Expression des (der) ersten 
Gens (Gene) erheblich beeinflusst, wenn ein Protein, das von der EmpfSngerzelle exprimiert wird. an die 
Ziel-DNA-Sequenz bindet. 

b) ein zweites Operon, wobei das Operon umfaRt: 

i) ein zweites Bindungsmarkergen oder-gene, 

ii) einen zweiten Promotor. der die Expression des (der) zweiten Bindungsmarkergens oder-gene steuert. 

und ^ 

iii) eine zweite Kopie der Ziel-DNA-Sequenz. wobei die Ziel-DNA-Sequenz die Expression des (der) zwet- 
ten Gens (Gene) erheblich beeinflusst. wenn ein Protein, das von der Empfangerzelle exprimiert wird. an 
die Ziel-DNA-Sequenz bindet, 

wobei die Bindungsmarkergene des ersten und zweiten Operons unterschiedlich sind und wobei. wenn 
die transformierten Zellen Vonwarts-Selektionsbedingungen ausgesetzt werden. die Genprodukte der 
ersten und zweiten Bindungsmarkergene schadlich oder t<5dlich fur die Empfangerzelle sind. 

4 Vektor nach Anspruch 3. in dem mindestens eines der Operons einen Genotyp verleiht. ausgewAhlt aus der 
Gruppe ga!IK\ teL^\ lacT , BtieS\ aisP\ 

. iami* , tooA* . cir* , isx* . acoE* . cysK* . ""^ 

5. Vektor nach Anspruch 3, wobei die Bindungsmarkergene in funktioneller Hinsicht nicht venwandt sind. 

6. Vektor nach Anspruch 5. wobei das erste bzw. zweite Operon ein Genotyp-Paar verleihen. ausgewShtt aus der 
Gruppe 

(a) gaTLK* und tetA* : 

(b) araP* und BheS*^ : 

(c) \3Qr und IfiiA* ; 

(d) 5j£tA* und c^"" ; 

(e) cre^ und thyA" ^ ; 

(f) lamff" und thyA* ; 

(g) SfiCA'' /malE^ /lac2* u nd pytE* : 

(h) tsx*und£^*; 

(i) dctA* und thvA* ; 
0) oalTK^ unrf QdfiT ; 
(k)tetA*undttaA* ; 
(t) ptsM ^ und tby&* ; 
(m) ompA * und pvrF* ; 
(n) btuB* und eyrT" ; 
(0) tonA* und ga!IJ<* : 
(p) or* und cysK "^: und 
(q) aroP* und lacZ '^ . 
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7. Vektor.nach einem der Ansprtich 3 bis 6. wobei die Promotoren der ersten und zweiten Operons unterschiedlich 
sind- 

8. Vektor nach Anspruch 7, wobei der Grad der Homologie zwischen den ersten und zweiten Promotoren im Bereich 
zwischen dem -10-Bereich des Promoters und der Base, bei der die Transkription beginni, geringer als 50 % ist. 

9. Vektor nach einem der Anspruche 3 bis 8, wobei mindestens eines der Operons eine Vielzahl von Kopien der Ziel- 
DNA-Sequenzen umfaRt. wobei jede Kopte so positioniert ist. daB die Ziei-DNA-Sequenz die Expression erheblich 
beeinf lusst. wenn und nur wenn ein Protein, das von der Empfangerzelle exprimiert wird. an die Ziel-DNA-Sequenz 
bindet. 

1 0. Vektor nach einem der Anspruche 3 bis 9. ferner umfsssend eine Vieizahl von genetischen Elementen, die wesent- 
iich sind fiir die Erhaltung des Vektors Oder das Oberleben der transformierten Zellen unter Bedingungen, die auf 
das Vorhandensein des Vektors selektieren. wobei die Operons und die genetischen Elemente so auf dem Vektor 
positioniert sind, daS kein EinzeWeietionsereignis eine Nichtfunktionalit^t von mehr als einem der Operons verur- 
sachen kann. ohne auch eines der wesentltchen genetischen Elemente nichtfunktionel! zu machen. 

11. Vektor nach Anspruch 10, wobei mindestens eines der genetischen Elemente ein selektierbar nutzliches oder 
essentielles Gen umfaBt, und einen Kontrollpromotor, der funktionell mit dem nutzlichen oder gegebenenfalls 
essentiellen Gen verknupft ist, wobei aber keinesfalls die Ziel-DNA-Sequenz mit dem genetischen Element asso- 
zitert ist. 

12. Vektor nach Anspruch 1 1 . wobei der Kontrollpromotor im Wesentlichen identisch mit dem Promoter von einem der 
selektierbaren Bindungsmarkeroperons ist. so daB Proteine. die an den letzteren Promotor binden. auch an den 
Kontrollpromotor binden und dabei die Expression des nutzlichen oder essentiellen Gens hemmen. 

1 3. Vektor nach einem der Anspruche 3 bis 1 2, wobei unter reversen Setektionsbedingungen die Genprodukte der Bin- 
dungsmarkergene nutzlich oder bedingt essentiell fur die transformierten Zellen sind. 

14. Vektor nach Anspruch 13. wobei jedes der ersten und zweiten Operons einen Phenotypen verleiht, unabhangig 
aber nicht identisch ausgewShlt aus der Gruppe bestehend aus: ga|LK* . tetA* . [acZ* , pheS"" . argP* , thyA* . crfi* 
, pyrF ^ . BfeM* . secA+ /mali* /jacZ* , omDaA "^ . btuB* . [amB* , tonA* . or* . tex* . aroP*. cysK"- . unddcJA*. 

15. Vektor nach einem der AnsprQche 3 bis 14. ferner umfassend eine nicht-schadliche Clonierungsstelle, die so posi- 
tioniert ist. daB die Insertion eines Fremdgens an der Stelle nicht die ersten oder zweiten Operons oder irgendein 
genetisches Element des Vektors, das fur die Erhaltung innerhalb der transformierten Zelle erfordertich ist, inakti- 
viert. 

1 6. Selektionsvektor nach einem der Anspruche 3 bis 1 5, wobei eine Zielsequenz, die mit mindestens einem der Ope- 
rons assoziiert ist. innerhalb der RNA-Polymerase-Bindungsstetle des Promoters des Operons positioniert ist. 

1 7. Selektionsvektor nach einem der Anspruche 3 bis 1 6, wobei eine Zielsequenz, die mit mindestens einem der Ope- 
rons assoziiert ist. stremaufwarts vom -35-Bereich des Promoters des Operons positioniert ist. 

18. Selektionsvektor nach einem der Anspruche 3 bis 1 7. wobei eine Zielsequenz. die mit mindestens einem der Ope- 
rons assoziiert ist. stromabwarts vom -10-Bereich des Promotors des Operons positioniert ist. 

1 9. Selektionsvektor nach einem der Anspruche 3 bis 1 8. wobei eine Zielsequenz so positioniert ist, daB die 5-nachst- 
gelegene Base der Zielsequenz in die +1 -Base oder +2-Base der mRNA transkribiert wird, die unter der Steuerung 
des Promotors des Operons transkribiert wird. 

20. Selektionsvektor nach Anspruch 10. wobei eines der genetischen Elemente der Replikalionsursprung des Vektors 
ist. 

21. Vector nach einem der Anspruche 3 bis 20, ferner umfassend ein Gen (pdboi . das bei Expression ein potentieiles 
DNA-Bindungsprotein oder -polypeplid codiert, wobei das Gen umfaBt: 

a) einen Codierungsbereich. der bei Expression ein Polypeptid codiert. wobei jede Domane des Polypeptids 
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mindeslens 50 % Sequenzidentitat zu einer bekannten DNA^Bindungsdomane aufweist. und 

b) einen Promotor. der funktionell verknupft ist mit dem Codierungsbereich zur Kontrolle seiner Expression. 

22. Vektor nach einem der Anspruche 3 bis 20. wobei der erste Oder zweite Promotor ein induzierbarer Oder reprimier- 
barer Promotor ist. 

23. Vektor nach einem der Anspruche 3 bis 22, wobei die Ziel-DNA-Sequenz 10 bis 25 Basenpaare umfaBt. 

24 Veklor nach einem der Anspruche 3 bis 22, wobei keine Kopie der Ziel-DNA-Sequenz naturlichen«eise in dem 
' ersten Promotor, dem zweiten Promotor, den nicht-codierenden Bereichsn des ersten Binaungsmarkergens Oder 
dsn nichicodierenden Bereichen des zweiten Bindungsmarkergens vorkommt. 

25. Vektor nach Anspruch 1. wobei der Selektlonsvorteil die Fahigkeit ist, ein bestimmtes Nahrungsmittel besser zu 
venfverten. oder eine geringero Abhangigkeit von einem bestimmten Nahrungsmittel. 

26. Vektor nach Anspruch 1 , wobei der Selektlonsvorteil die Resistenz gegenCiber einer Substanz ist, die sonst toxisch 
fur die Empfangerzelle ist. 

27. Vektor nach Anspruch 26, wobei der Selekbonsvorleil die Resistenz gegenuber einem Antibiotikum ist. 

28. Population von Vektoren nach Anspruch 21 , die zutailig mutiert sind, so daB sie maglicheoweise eine der 2 bis 20 
vorbestimmten Aminosauren codieren, an vortjestimmten Codons innerhalb des udte-Gens so da8 die Vektoren 
vereint eine Vielzahl von unterschiedlichen aber sequenzvenwandten moglichen DNA-Bindungsproteinen expri- 
mieren kOnnen. 

29. Population nach Anspruch 28, wobei das AusmaB der Zufallsmutation so ist, daB 10* bis 10« unterschiedliche 
potentielle DNA-Bindungsproteinen exprimiert werden kOnnen. 

30 Zellkultur umfassend eine Vielzahl von Zellen, wobei jede Zelle einen Selektionsvektor nach einem der Anspruche 
' 1 bis 27 tragt wobei die Zelle ein Gen trSgt. das bei Expression ein potentielles DNA-Bindungsprotein orfiert, 

wobei die Zellen vereint eine Vielzahl von unterechiedlichen aber sequenzverv/andten potentiellen DNA-Bindungs- 
proteinen exprimieren kOnnen. 

31 Vertahren zum Erhalt eines Gens, das bei Expression ein neues DNA-BindungsproteIn oder -po^ipeptid codiert. 
das spezifisch an eine vorbestimmte DNA-Zlelsequenz in doppelstrangiger DNA bindet, umfassend: 

a) Bereitslellung einer Zellkultur nach Anspruch 30. 

b) Veranlassung der Zellen einer sdchen Kultur, die potentiellen DNA-Bindungsproteine oder -polypeptide zu 

cMJnterwelfen der Zellen unter Vofwarts-Selektionsbedingungen, urn Zellen auszuwahlen, die ein Protein 
Oder Pdypeptid exprimieren, das spezifisch an die Ziel-DNA-Sequenz bindet, und „ . ^ ^ 

d) Gewinnung der ausgewahlten Zellen. die ein Gen tragen. das bei Expression ein solches Protein oder Pdy- 
peptid codiert. 

32 Venahren zur Produktioo eines DNA-Bindungsproteins oder -polypeptlds. das spezHisch an ein vorbestimmtes 
doooelstrangiges DNA-Ziel bindet. umfassend das Erhalten eines Gens durch das Verfehren nach Anspruch 31. 
welches bei Scpression ein solches Protein oder Polypeptid codiert. Expression des Gens in einer geeigneten 
Wirtszelle und Gewrinnung des Proteins oder Polypeptids. 

33. Vertahren zum Erhalten eines Proteins oder Polypeptids, das venr/endet werden tann um speztfisch ein o^ererv 
des Oder regulatorischas Element von Interesse zu reprimieren. umfassend das Erhalten e.n^ Gens gen^B dern 
Vertahren von Anspruch 31. Bestimmung der Sequenz von mindestens der DNA-Blndungsdomaoe des Proteins 
Oder Polypeptids und Produktion von mindestens einer DNA-Blndungsdonfiane des Proteins oder Polypeptids. 

34 Vertahren nach Anspruch 31 , wobei ein Gen. das ein bekanntes DNA-Bindungsprotein codiert. aus^dhlt aus der 
GruTe bestehend aus Cro vom Phagen X. d-Repressor vom Phagen X, Cro vom ^^^^^J^'f^^^^''^ 
Phagen 434. P22-Repressor, E. M-Tryplophan-Repressor, E. so!i-CAP, P22 Arc, P22 Mnt, e^-La^ose- 
Repressor, MAT-a1-alpha2 aus Hefe, Polyoma Large T-Antigen. SV40 Large T-Antigen. Adenovirus El A. und TFI- 
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IIA von XenoDus laevis zufailig-mutiert ist. um rrxjglichGnveise eine der 2 bis 20 vorbestimmten Aminosauren zu 
codieren. an vorbestimmten Codons. so daR Gene erhalten werden. die bei Expression eine Vielzahl von mGgl.- 
chen Ziel-DNA-Birxiungsproteinen codieren. 

35. Verfahren nach Anspruch 31, wobei das DNA-Bindungsprotein eine Vielzahl von Zink-Finger-DNA-Bindungsdoma- 
nen umfaBt. 

36 Verfahren nach Anspruch 31. wobei die Zellen vor der Transformation vom GalE'. GalT. GalK'. Tet^ -Phenotyp 
sind die Bindungsmarkergene die let- und oalTK -Gene sind und die VonA^arts-Selektionsbedingung die Zuchtung 
der Zellen in einem Medium 1st. das Galactose Oder Fusarinsaure Oder beidss enthait oder Suostanzen. die tn 
Galactose oder Fusarinsaure verstoffwechselt werden oder deren Produktion katalysieren, 

^7 vsrfahren nach Anspruch 31 . v«)bei das Qdbfi-Gen zufailig mutiert ist. so daR es eine der 2 bis 20 vorbestimmten 
Aminosauren codiert. an vorbestimmten Codons. so daft mindestens ein zufailig mutiertes Codon des Gens gene- 
tisch alle 20 Aminosauren codiert und an diesem Codon einen OberschufJanteil der am wenigsten bevorzugten 
Aminosaure bis zur am meisten bevorzugten Aminosaure aufv^eist. der grORer ist als der. der mH einem NNN- 
Codon erhalten wird. wobei N ein equimolares Gemisch von G, A. T und C bedeutet. 

38. Verfahren nach Anspruch 37. wobei fur das Codon die Haufigkeiten. mil denen saure und basische Aminosauren 
codiert werden. glelch sind. 

39. Verfahren nach Anspruch 37. wobei das zufailig mutierte Codon im Wesentlichen die folgenden Basenanteile auf- 
weist: 





1 


C 


A 


G 


Base Nr. 1 


0.26 


0.18 


0.26 


0.30 


Base Nr. 2 


0.22 


0.16 


0.40 


0.22 


Base Nr. 3 


0.5 


0.0 


0.0 


0.5. 



Revendications 

1 Vecteur de selection pour s61ectionner des cellules r6ceptrices transform6es par un tel vecteur qui expriment une 
prot6ine ou un polypeptide qui se lie sp6ctf iquement k une borne de sequence d'ADN ciWe pr6determin6e par ledit 
vecteur qui comprend un premier et un deuxi^me op6ron. conprenant chacun au moins un g6ne expnmable. les 
g^nes desdits premier et deuxi^e op^rons 6tant diff6rents. une copie de la sequence d'ADN able 6tant incluse 
dans chaque operon et positionn^e dedans de sorte que les cellules r^ceptrices jouissent d'un avantage s6lectif. 
autre qu'une resistance k la croissance lytique de phage, si elles peuvent exprimer une prot6ine ou un polypeptde 
qui se lie auxdites copies de ta s6quence d'ADN cible. 

2 Vecteur selon ia revendication 1 . dans lequel au moins un op6ron comprend un g6ne b6n6fique s6lectionnable. un 
promoleur pouvant etre bloqu6 Ii6 de mani6re utilisable audit g6ne b^nefique et dirigeant sa transcription, un pro- 
moteur de Wocage bloquant la transcription a partir dudit promoteur pouvant §tre bloqu6 dudrt g6ne ben6fique. el 
une copie de la s6quence d'ADN cible positionn6e de sorte que la liaison de ladite prot6ine ou du polypeptide a 
ladite copie r6prime ledit promoteur de blocage et facilite ainsi la transcription dudrt g^ne b6n6fique. 

3. Vecteur de selection selon ta revendication 1 ou 2. qui comprend: 

a) un prerrter op6ron, op6ron qui comprend; 

i) un ou des premiers gfenes marqueurs de liaison; 

ii) un premier promoteur r6gulant I'expression dudit ou desdits g6nes marqueurs de liaison; et 

iii) une premiere copie de la sequence d'ADN cible. ou ladite s6quence d'ADN cible interfere senstblemenl 
avec rexpression du ou des premiers g6nes si une prot^ine exprim^e par la cellule receptnce se he a la 
s^uence d'ADN cible: 
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b) un deuxi^me operon, operon qui comprend: 

\) un ou des deuxi^mes g^nes marqueurs de liaison; 

ii) un deuxieme promoteur r^gulant I'expression dudit ou desdits deuxi^mes g^nes marqueurs de liaison; 
et 

iii) une deuxieme copie de la sequence d'ADN cible. ou ladite sequenca d'ADN cible interfere senstble- 
ment avec rexpression du ou des deuxi^mes gfenes si une prot6ine expt im^e par la cellule r^ceptrice se 
lie ^ la s^uence d'ADN cible, 

ou les g§nes marqueurs de liaison desdits premier et deuxieme op6rons sont diff6rents, et ou. quand les- 
dites cellules transformees sont expos^es k des conditions ds selection avancees. ies produits des genes 
desdits premier et deuxieme g6nes marqueurs de liaison sont d6let6res ou lethaux k la cellule receptrice. 

4. Vecteur salon la revendication 3. dans lequel au moins un des op6rons conf^re un genotype chotsi dans le groupe 
form6 par aalTK^ . tgiA± QhfiS^. aroE^. IbyA^. OTX^, Q^'. aeGA^VmalEi/lacZ^. QmeAi. 

tonA-. cir^. tsx^. aroP ^. cvsK^ et dctA^. 

5. Vecteur selon la re\/endication 3. dans lequel les g^nes marqueurs de liaison ne sont pas fonctionnellement appa- 
rent§s. 

6. Vecteur selon la revendication 5, dans lequel les premier et deuxieme op6rons conferent, respectivement. une 
paire de genotypes choisis dans le groupe form6 par: 

(a) gaiXK- et tfit^i. 

(b) araP ^ et oheS^ : 

(c) !acZ± et tetA^ : 

(d) dctA^ et cysK^ ; 

(e) crp^ et thvA^ ; 

(f) lamBi et tbyA^i 

(g) SfisA^maEMasZ^ et e^^: 

(h) isxi et QfiKi; 

(i) d£tAietlhyA±; 
0) oalTK ^ et pheS ^; 
(k)letA^etttiYA^: 
(I) ptsM ^ et IhyM; 
(m) ompA^ et pyrF^ : 
(n) btuB± et PvrF ^: 
(0) tonA^ et oalT.K^ ; 
(p) cir± et cysK-; et 
(q) aroP± et lacZ^ . 

7. Vecteur selon Tune quelconque des revendications 3^6. dans lequel les promoteurs desdits premier et deuxifeme 
op6rons sont diff^rerrts. 

8. Vecteur selon la revendication 7, dans lequel le degre d'homologie entre les premier et deuxi6me promoteurs est 
inf^rieur k 90% dans la zone entre la zone -10 du promoteur et la base au niveau de laquelle la transcription est 
amorc6e- 

9. Vecteur selon I'une quelconque des revendications 3^8. dans lequel au moins un desdits op§rons comprend une 
plurality de copies des sequences d'ADN cible. dans lequel chaque copie est positionn6e de sorte que la sequence 
d'ADN cible interf^e sensiWement avec I'expression si et seulement si une prot6ine exprim^e par la cellule recep- 
trice se lie k la sequence d'ADN cible. 

1 0. Vecteur selon I'une quelconque des revendications 3^9, comprenant de plus une pluralit6 d'6l6ments g6n6tiques 
essentials k la maintenance du vecteur ou ^ la survie des cellules transform§es dans des conditions qui selection- 
nent pour la presence dudit vecteur. lesdits op6rons et lesdits elements g6n6tique etant positionn§s sur ledit vec- 
teur. ainsi aucun cas de d6l6tion unique ne peut rendre non fonctionnel plus d'un desdits op6rons sans rendre 
egalement non fonctionnel un desdits elements g6netiques essentiels. 



172 



EP0 452 413B1 



11. Vecteur^seion la revendication 10. dans lequel uh desdits elements g6netiques comprend un gene selectivement 
benefique ou essentiel. et un promoteur t6moin U de fa^on utilisaWe audit g6ne benefique ou conditionnellement 
essential, mais ou aucun exemple de iadite sequence d'ADN ciWe n'est asscx;ie audit element g^netique. 

12. Vecteur selon la revendication 11. dans lequel le promoteur t6moin est essentiel I ement identique a le promoteur 
d'un desdits op6rons marqueurs de liaison setectionnables. de sorte que les prot^ines se liant au dernier promo- 
teur se lieront 6galement au promoteur t6moin et inhiberont ainsi ('expression dudit g6ne b6n6fique ou essentiel. 

1 3. Vecteur selon la revendication selon I'une quelconque des revendications 3^12, dans lequel. dans des conditions 
de selection inverse, les produits de n^nes desdits g^nas marqueurs de liaison sont b6neJiques ou conditionne- 
ment essentials aux cellules transform6es, 

14. Vecteur selon la revendication 13, dans lequel chacun des premier el deuxi^me op6rons conf^re un ph6notype 
choisi ind6pendamment mais pas de manifere identique dans le groupe form6 par: gallK-. ietA-. obfiS^. 
argPi. thyA^ . or^, pvrF ^. ptsM^ . secA VmalE^ /lacZ^. omoA^ . btuB^. lamB^ . tonAi. cir±. tsx^, aroP-. cysK^ et 
dclA^ . 

15. Vecteur selon I'une quelconque des revendlcations 3 k 14. comprenant de plus un site de clonage non delet^re 
positionn6 de fagon k ce que insertion d'un g6ne etranger au niveau d'un tel site n'inactive pas lesdits premier ou 
deuxi^me op6rons ou tout 6!6ment g6n6tique dudit vecteur n§cessaire pour sa maintenance dans la cellule trans- 
form6e. 

16. Vecteur de selection selon I'une quelconque des revendications 3 615. dans lequel une sequence cible associee 
k au moins un desdits op6rons est positionn^e dans le site de liaison d*ARN-polym6rase du promoteur de I'op^ron. 

1 7. Vecteur de selection selon I'une quelconque des revendications 3 6 16, dans lequel une sequence cible associ6e 
k au moins un desdits op6rons est positionnee en amont de la zone -35 du promoteur de I'op^ron. 

18. Vecteur de selection selon I'une quelconque des revendications 3 617. dans lequel une sequence ciWe associ6e 
6 au moins un desdits op6rons est positionnee en aval de la zone -10 du promoteur de rop^ron. 

19. Vecteur de selection selon I'une quelconque des revendications 3 6 18. dans lequel une sequence cible est posi- 
tionnee de sorte que la plus grande partie de la base 5' de la sequence cible est transcrlte dans la base +1 ou la 
base +2 de I'ARNm transcrit sous la direction du promoteur de Top^ron. 

20. Vecteur de selection selon la revendication 10. dans lequel un desdits elements g6n6tiques est I'origine de la repli- 
cation dudit vecteur 

21 . Vecteur selon I'une quelconque des revendications 3 6 20, comprenant de plus un g6ne (pdbp) codant I'expression 
pour une proteine ou un polypeptide de liaison 6 un ADN potentiel. ledit gene comprenant: 

a) une region codante qui code I'expression pour un polypeptide, chaque domaine dudit polypeptide ayant au 
moins 50% d'identite de sequence 6 un domaine de liaison 6 un ADN connu; et 

b) un promoteur lie de fagon utilisable 6 Iadite region codante pour reguler son expression. 

22. Vecteur selon I'une quelconque des revendications 3 6 20, dans lequel ledit premier ou deuxieme promoteur est un 
promc^eur inductible ou reprin^ble. 

23. Vecteur selon rune quelconque des revendications 3 6 22, dans lequel la sequence d'ADN ciWe comprend 10 6 25 
paires de t>a8e.^ 

24. Vecteur seion I'une quelconque des revendications 3 6 22, dans lequel aucune copie de la sequence d'ADN cible 
n'apparait naturellement dans ledit premier promoteur. ledit deuxieme promoteur. les regions non codantes dudit 
premier gene marqueur de liaison ou les regions non codantes dudit deuxieme g6ne nnarqueur de liaison. 

25. Vecteur selon la revendication 1 . dans lequel Tavantage seiectif est la capacite 6 mieux utilise un nutriment parti- 
culier ou une d6pendance reduite envers un nutriment particulier. 
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26. Vecteur selon la revendication l. dans lequel L'avantage s^lectil est une resistance k une substance autrement 
toxique envers ta cellule receptrtce. 

27. Vecteur selon la revendication 26. dans lequel l'avantage s^lectrf est une resistance ^ un antibiotique. 

28. Population de vecteurs selon la revendication 21 . mut^sde fagon al^atoire pour coder potentiellement Tun quelcon- 
que de 2 ^ 20 aminoacides pr6d6termin6s. au niveau de codons predetermines dans le gene de sorte que 
lesdits vecteurs peuvent exprimer coltectivement une pluralite de proteines de liaison k un ADN potentielles difte- 
rentes mais apparentees en sequence. 

29. Population selon la revendication 28, dans laquelle le niveau de mutation al6atoire est tel que de 10^ ^ 10^ protei- 
nes de liaison k un ADN potentielles differentes peuvent gtrs exprimees. 

30 Culture cellulaire comprenant une pluralite de cellules, chaque cellule portant un vecteur de selection selon Tune 
quelconque des revendi cations 1 k 27. ladite cellule portant un gene codant I'expression pour une proteine de 
liaison k un ADN potentielle. ou lesdites cellules peuvent collectivement exprimer une pluralite de proteines de 
liaison k un ADN potentielles differentes mais apparentees en sequence. 

31 Procede pour obtenir un gene codant Texpression pour une nouvelle proteine ou un polypeptide de liaison k un 
ADN qui se lie specif iquement k une sequence cible d'ADN predeterminee dans I'ADN double-bnn. comprenant les 
etapes consistant: 

(a) k fournir une culture cellulaire selon la revendication 30; 

(b) k faire exprimer par les cellules d'une telle culture lesdites proteines ou lesdits polypeptides de liaison k un 

ADN potentiels; . . ^ , • 

(c) k exposer les cellules k des conditions de selection avancee pour choisir des cellules qui expnment une 
proteine ou un polypeptide qui se lie specif iquement k une desdites sequences d'ADN cibles; et 

(d) k recuperer les cellules choisies portant un gene codant ['expression pour une telle prot6.ne ou polypeptide. 

32 Procede pour produire une proteine ou un polypeptide de liaison k un ADN qui se lie specrtiquement k une cible 
d'ADN double4)rin predeterminee. qui consiste k obtenir un gene par le precede selon la revendication 31. qui 
code rexpression pour une telle proteine ou polypeptide, k exprimer le gene dans une cellule hdle appropri6. et k 
recuperer ladrte proteine ou ledit polypeptide. 

33. Precede pour obtenir une proteine ou un polypeptide qui peut etre utilise pour reprimer spedf iquement un ^6ment 
de codage ou r6gulateur consid6r6 qui consiste k obtenir un gene par le procede de la revendication 31, k deter- 
miner la sequence d-au moins le domaine de liaison k un ADN de la proteine ou du polypeptide, et k produire au 
moins le domaine de liaison k un ADN de ta proteine ou du polypeptide. 

34 Precede selon la revendication 31 . dans lequel un g6ne codant une proteine de liaison k un ADN oonnue pr6lev6 
dans le groupe Ibrme par Cro provenant du phage X. represseur du cl provenant du phage X. Cro provenant du 
phage 434 r6presseur du cl provenant du phage 434. represseur P22. represseur du tryptophane de E^. 
CAP. P22 Arc P22 Mnt. represseur du lactose de EfiolL MAT-al •alpha2 provenant de la levure, antigene Polyoma 
Urge T. antigene SV40 Large T. Adenovirus E1 A. et TFIIIA provenant de XQnppg? lagvi^ est mute de fa^on aiea- 
toire pour coder potentiellement I'un quelconque des 2 ^ 20 aminoacides predetermines, au niveau de codons pre- 
determines, pour obtenir des genes codant rexpression pour une pluralite de proteines de liaison k un ADN ciWe 
poterrtietles. 

35. Precede selon la revendication 31. dans lequel la proteine liant l ADN comprend une pluralite de domaines de 
liaison k un ADN type doigt k zinc. 

36 Precede salon la revendication 31. dans lequei les cellules avant la transformation sont d'un phenotype GalE. 
" GalT GalK- Tet« les genes marqueursde liaison sent les genes tfit et gailK. et la condition deselection avancee 

est la'culture des cellules dans un milieu centenant du galactose, ou de racide fusarique. eu les deux, eu des subs- 
tances metabolisees k I'interieur au en catalysant la production de galactose ou d'actde f usanque. 

37 Precede selon la revendication 31 . dans lequel ledit gene odb© est mute de fagon a!6atoire pour coder potentielle- 
ment run quelconque des 2 i 20 aminoacides predetermines, au niveau de codons predetermines, de sorte qu au 
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moins un cpdon mut6 de fa9on aJ6atoire dudit gene code genetiquement Tensemble des vingt aminoacides et pro- 
duft pour ce codon un rapport d'abondance de l aminoacide !e moins favorrs6 ^ I'aminoacide le plus lavorise qui est 
sup6rieur ^ celui obtenu avec un codon NNN. N representant un melange ^uimolatre de G, A. T et C. 

5 38. Proc6d6 selon la revendication 37, dans lequel pour ledit codon. les frequences, avec lesquelles les aminoacides 
acides et baslques sont cod6s. sont egales. 

39. Proc6d6 selon la revendication 37, dans lequel ledit codon mute de fagon al6atQire poss6dG sensiblement les pro- 
portions suivantes de bases: 
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PvuU 



DELTA 4CaiS CONTAINING pKKI7S-6 ARE 
AmpR, TetS, FusR, AND GolR. 

T2 iSrrnBtl; T3 IS rrnBtZ^ P2 IS Pomp 
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FIG. 4. 




DELTA4 CELLS CONTAINING pAA3Hore 
AmpR.TetS, FusR, AND GolS 

TZiSrrnBtl ; T3 IS rrn»2, 
PI is pBR322 PI PROMOTER; 
P2 is Pamp 
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EcoR] 




BomHl 



DELTA 4 CELLS CONTAINING pEPIOOl ARE 
AmpR, TetS.FusR.GoiS 

PtlSpBR322 PI IS PROMOTER^ 
P2 )S Pomp 

FIG. 6. 
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DELTA 4 CELLS CONTAINING pEPIOOZ ARE 
AfflpR, TtfS. FusR , AND 6olS 

PI iSpBR322 PI PROMOTER , P2 IS Pomp 
Tl IS ph09« f<i TERMINATOR iTZ iSrrnBtl; 
T3 is rrnBtZ; 

FIG. 7. 
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DELTA 4 CELLS CONTAINING pEP1003 ARE 
ArapR, TetS, FusR , AND 6olS 
P2 is Pomp 

Tl IS phoge fd TERMINATOR 

FIG. 8. 
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HBIOI CELLS CONTAINING pEP1004 ARE 
AmpR, lets, FusR.ANDGoK 

P2 is Pomp « „io 

Tl is phoge fd TERMINATOR, T2 IS rrnBtl,T3 IS rrn»2. 
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FIG. 10. 
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DELTA 4 CELLS CONTAINING pEPlOOS ARE 
AmpR, TefS, FusR, AND GalS 

P2 is Pomp, P3iS Pneo 

T1 is phage fd TERMINATOR, T2 IS rrnBfl T3iS rrnBtZ 



185 



EP0 452 413B1 



FIGII, 

EcoRl 




0EL1A4 CELLS CONTAINING pEPI007ARE 
AmpR, T«tR,Fu8S, AND GolS 

P2 is P»mp,P3 is Pneo, P4 ISPIocUVS 

Tl IS phoge fd TERMINATOR, T2 ISrrnBtlJ3 ISrrnBtZ 

T4 is trp A TERMINATOR 
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FIG. 12. 
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0ELTA4CEaS CONTAINING pEPIOOS ARE 
AfflpR,TetS,FasR,AND6olRIN PRESENCE OF IPT6 
AmpR JetR, FusS, AND GoiS IN ABSENCE OF IPTG 

PZ is Pomp, P3 ISPneo, P4 IS P locUVS 

T1 (S phoge fd TERMINATOR, T3 iSrrnBtl,T3 iSrrnBt2 

T3iS trpA TERMINATOR 
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